Introduction

About this handbook

About this handbook

Objective

To make an open-access digital R reference book which is catered to epidemiologists and public health practitioners, usable offline, and addresses common epidemiological tasks via clear text explanations, step-by-step instructions, and best practice R code examples

The problem:

Most online R help resources are not task-centered nor epidemiology-focused. Epis learning or new to R often must Google and skim dozens of forum pages to complete common data manipulation and visualization epi tasks. Furthermore, field epidemiologists often work in low internet-connectivity environments and have limited technical support.

How to read this handbook:

  • This handbook in an HTML file. It is not online, you are only using your web browser to view this local file.

  • This handbook is best viewed with Google Chrome. Some functions may not work in other browsers.

  • Use tabs on the right to hide/view code. See ‘Copy to clipboard’ icon in the upper-right of each code section

Version
The latest version of this handbook can be found at this github repository.

Style

  • The handbook gives one recommended way of completing a task, and offers other methods/packages as appropriate.
  • The handbook generally uses tidyverse R codying style. Read more here (TODO).
  • Package names are written in bold (e.g. dplyr) and functions are like this: mutate().
  • Code in this handbook sometimes explicity names the package of a function (e.g. dplyr::mutate()), so it is clear to the reader which package is being used.

Types of notes

FOR EXAMPLE: This is a boxed example

NOTE: This is a note

TIP: This is a tip.

CAUTION: This is a cautionary note.

DANGER: This is a warning.

Datasets used

Datasets used

Here the datasets used in this handbook will be described and will be “downloadable” via link (the files will be stored within the HTML, so available offline as well)

  • Linelist (…) Linelist for the 2013 (first wave) H7N9 outbreak in China (source)
  • Aggregated case counts (…)
  • GIS coordinates (…)
  • GIS shapefile (…)
  • modeling dataset? (…)

Contributors

Contributors

Editor-in-Chief: Neale Batra ()

Editorial core team:

Authors:

Reviewers:

Advisors

Data contributors:
outbreaks package

Some of this material comes from the R4Epis website, which was also made by some of the same people…

RECON packages

Photo credits (logo): CDC Public Image gallery; R Graph Gallery

R Basics

Overview

This section is not meant as a comprehensive “how to learn R” tutorial. However, it does cover some of the fundamentals that can be good to reference or refresh.

More comprehensive tutorials are available online:
* Here
* and Here
* and even Here
* Oh yea and Here too (there’s a lot of them)

Why use R?

Why use R?

  • Reproducibility
  • Fewer errors
  • Collaboration
  • Free

Installation

Installation

How to install R

How to install R Studio

Other things you may need to install:
* TinyTeX
* Pandoc
* RTools

RStudio

RStudio

RStudio Orientation

First, open RStudio. As their icons can look very similar, be sure you are opening RStudio and not R.

For RStudio to function you must also have R installed on the computer (see this section for installation instructions).

RStudio is an interface (GUI) for easier use of R. You can think of R as being the engine of a vehicle, doing the crucial work, and RStudio as the body of the vehicle (with seats, accessories, etc.) that helps you actually use the engine to move forward!

By default RStudio displays four rectangle panes.

TIP: If your RStudio displays only one left pane it is because you have no scripts open yet.

The R Console Pane

The R Console, by default the left or lower-left pane in R Studio, is the home of the R “engine”. This is where the commands are actually run and non-graphic outputs and error/warning messages appear. You can directly enter and run commands in the R Console, but realize that these commands are not saved as they are when running commands from a script.

If you are familiar with Stata, the R Console is like the Command Window and also the Results Window.

The Source Pane
This pane, by default in the upper-left, is space to edit and run your scripts. This pane can also display datasets (data frames) for viewing.

For Stata users, this pane is similar to your Do-file and Data Editor windows.

The Environment Pane
This pane, by default the upper-right, is most often used to see brief summaries of objects in the R Environment in the current session. These objects could include imported, modified, or created datasets, parameters you have defined (e.g. a specific epi week for the analysis), or vectors or lists you have defined during analysis (e.g. names of regions). Click on the arrow next to a dataframe name to see its variables.

In Stata, this is most similar to Variables Manager window.

Plots, Packages, and Help Pane
The lower-right pane includes several tabs including plots (display of graphics including maps), help, a file library, and available R packages (including installation/update options).

This pane contains the Stata equivalents of the Plots Manager and Project Manager windows.

RStudio settings

Change RStudio settings and appearance in the Tools drop-down menu, by selecting Global Options

Scripts

Scripts

Why use a script

R scripts (vs. typing in the console)
* Advantages (reproducability) * General sequence (into, load packages, load data, clean data, conduct analysis, save results) * Commenting

Rmarkdown

R notebooks

RShiny

Working directory

Working directory

These tabs cover how to use R working directories, and how this changes when you are working within an R project. The working directory is the root file location used by R for your work.
By default, it will save new files and outputs to this location, and will look for files to import (e.g. datasets) here as well.

NOTE: If using an [R project](#rproject), the working directory will default to the R project root folder **IF** you open RStudio by clicking open the R project (the file with .rproj extension))

Set by Command

Use the command setwd() with the filepath in quotations, for example: setwd("C:/Documents/R Files")

CAUTION: If using an RMarkdown script be aware of the following:

In an R Markdown script, the default working directory is the folder the Rmarkdown file (.Rmd) is saved to. If you want to change this, you can use setwd() as above, but know the change will only apply to that specific code chunk.

To change the working directory for all code chunks in an R markdown, edit the setup chunk to add the root.dir = parameter, such as below:

knitr::opts_knit$set(root.dir = 'desired/filepath/here')

Set Manually

Setting your working directory manually (point-and-click)

From RStudio click: Session / Set Working Directory / Choose Directory (you will have to do this each time you open RStudio)

In an R project

How things change in an R project

Objects

Objects

Everything in R is an object. These sections will explain:

  • How to create objects (<-)
  • Types of objects (e.g. data frames, vectors..)
  • How to access subparts of objects (e.g. variables in a dataset)
  • Classes of objects (e.g. numeric, character, factor)

Everything is an object

Everything you store in R - datasets, variables, a list of village names, a total population number, even outputs such as graphs - are objects which are assigned a name and can be referenced in later commands.

An object exists when you have assigned it a value (see the assignment section below). When it is assigned a value, the object appears in the Environment (see the upper right pane of RStudio). It can then be operated upon, manipulated, changed, and re-defined.

Creating objects (<-)

Create objects by assigning them a value with the <- operator.
You can think of the assignment operator <- as the words “is defined as”. Assignment commands generally follow a standard order:

object_name <- value (or process/calculation that produce a value)

EXAMPLE: You may want to record the current epidemiological reporting week as an object for reference in later code. In this example, the object reporting_week is created when it is assigned the character value "2018-W10" (the quote marks make these a character value).
The object reporting_week will then appear in the RStudio Environment pane (upper-right) and can be referenced in later commands.

See the R commands and their output in the boxes below.

reporting_week <- "2018-W10"   # this command creates the object reporting_week by assigning it a value
reporting_week                 # this command prints the current value of reporting_week object in the console
## [1] "2018-W10"

NOTE: Note the [1] in the R console output is simply indicating that you are viewing the first item of the output

CAUTION: An object’s value can be over-written at any time by running an assignment command to re-define its value. Thus, the order of the commands run is very important.

The following command will re-define the value of reporting_week:

reporting_week <- "2018-W51"   # assigns a NEW value to the object reporting_week
reporting_week                 # prints the current value of reporting_week in the console
## [1] "2018-W51"

Datasets are also objects and must be assigned names when they are imported.

In the code below, the object linelist is created and assigned the value of a CSV file imported with the rio package.

# linelist_raw is created and assigned the value of the imported CSV file
linelist <- rio::import("my_linelist.csv")

You can read more about importing and exporting datasets with the section on importing data.

CAUTION: A quick note on naming of objects:

  • Object names must not contain spaces, but you should use underscore (_) or a period (.) instead of a space.
  • Object names are case-sensitive (meaning that Dataset_A is different from dataset_A).
  • Object names must begin with a letter (cannot begin with a number like 1, 2 or 3).

Object Structure

Objects can be a single piece of data (e.g. my_number <- 24), or they can consist of structured data.

The graphic below, sourced from this online R tutorial shows some common data structures and their names. Not included in this image is spatial data, which is discussed in the GIS section.

In epidemiology (and particularly field epidemiology), you will most commonly encounter data frames and vectors:

Common structure Explanation Example from templates
Vectors A container for a sequence of singular objects, all of the same class (e.g. numeric, character). “Variables” (columns) in data frames are vectors (e.g. the variable age_years).
Data Frames Vectors (e.g. columns) that are bound together that all have the same number of rows. linelist_raw and linelist_cleaned are both data frames.

Note that to create a vector that “stands alone”, or is not part of a data frame (such as a list of location names), the function c() is often used:
list_of_names <- c("Ruhengeri", "Gisenyi", "Kigali", "Butare")

Object Classes

All the objects stored in R have a class which tells R how to handle the object. There are many possible classes, but common ones include:

Class Explanation Examples
Character These are text/words/sentences “within quotation marks”. Math cannot be done on these objects. “Character objects are in quotation marks”
Numeric These are numbers and can include decimals. If within quotation marks the will be considered character. 23.1 or 14
Integer Numbers that are whole only (no decimals) -5, 14, or 2000
Factor These are vectors that have a specified order or hierarchy of values Variable msf_involvement with ordered values N, S, SUB, and U.
Date Once R is told that certain data are Dates, these data can be manipulated and displayed in special ways. See the page on Dates for more information. 2018-04-12 or 15/3/1954 or Wed 4 Jan 1980
Logical Values must be one of the two special values TRUE or FALSE (note these are not “TRUE” and “FALSE” in quotation marks) TRUE or FALSE
data.frame A data frame is how R stores a typical dataset. It consists of vectors (columns) of data bound together, that all have the same number of observations (rows). The example AJS dataset named linelist_raw contains 68 variables with 300 observations (rows) each.

You can test the class of an object by feeding it to the function class(). Note: you can reference a specific column within a dataset using the $ notation to separate the name of the dataset and the name of the column.

class(linelist$age)     # class should be numeric
## [1] "numeric"

class(linelist$gender)  # class should be character
## [1] "character"

Often, you will need to convert objects or variables to another class.

Function Action
as.character() Converts to character class
as.numeric() Converts to numeric class
as.integer() Converts to integer class
as.Date() Converts to Date class - Note: see section on dates for details
as.factor() Converts to factor - Note: re-defining order of value levels requires extra arguments

Here is more online material on classes and data structures in R.

Variables ($)

Vectors within a data frame (variables in a dataset) can be called, referenced, or created using the $ symbol. The $ symbol connects the name of the column to the name of its data frame. The $ symbol must be used, otherwise R will not know where to look for or create the column.

# Retrieve the length of the vector age_years
length(linelist$age) # (age is a variable in the linelist data frame)

By typing the name of the data frame followed by $ you will also see a list of all variables in the data frame. You can scroll through them using your arrow key, select one with your Enter key, and avoid spelling mistakes!

knitr::include_graphics(here::here("images", "Calling_Names.gif"))

ADVANCED TIP: Some more complex objects (e.g. an epicontacts object may have multiple levels which can be accessed through multiple dollar signs. For example epicontacts$linelist$date_onset) .

Access/View with brackets ([])

You may need to view parts of objects, which is often done using the square brackets [ ].

To view specific rows and columns of a dataset, you can do this using the syntax dataframe[rows, columns]:

# View a specific row (2) from dataset, with all columns
linelist[2,]

# View all rows, but just one column
linelist[, "date_onset"]

# View values from row 2 and columns 5 through 10
linelist[2, 5:10] 

# View values from row 2 and columns 5 through 10 and 18
linelist[2, c(5:10, 18)] 

# View rows 2 through 20, and specific columns
linelist[2:20, c("date_onset", "outcome", "age")]

# View rows and columns based on criteria
# *** Note the dataframe must still be names in the criteria!
linelist[linelist$age > 25 , c("date_onset", "date_birth", "age")]

# Use View() to see the outputs in the RStudio Viewer pane (easier to read) 
# *** Note the capital "V" in View() function
View(linelist[2:20, "date_onset"])

# Save as a new object
new_table <- linelist[2:20, c("date_onset")] 

The square brackets also work to call specific parts of summary() function:

# All of the summary
summary(linelist$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00   47.25   60.00   56.91   72.00   91.00       2

#Just one part
summary(linelist$age)[2]
## 1st Qu. 
##   47.25

Functions

Functions

This section on functions explains:
* What a function is and how they work
* What arguments are
* What packages are
* How to get help understanding a function

Simple Functions

A function is like a machine that receives inputs, does some action with those inputs, and produces an output.
What the output is depends on the function.

Functions typically operate upon some object placed within the function’s parentheses. For example, the function sqrt() calculates the square root of a number:

sqrt(49)
## [1] 7

Functions can also be applied to variables in a dataset. For example, when the function summary() is applied to the numeric variable age in the dataset linelist (what’s the $ symbol?), the output is a summary of the variable’s numeric and missing values.

summary(linelist$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00   47.25   60.00   56.91   72.00   91.00       2

NOTE: Behind the scenes, a function represents complex additional code that has been wrapped up for the user into one easy command.

Functions with Multiple Arguments

Functions often ask for several inputs, called arguments, located within the parentheses of the function, usually separated by commas.

  • Some arguments are required for the function to work correctly, others are optional.
  • Optional arguments have default settings if they are not specified.
  • Arguments can take character, numeric, logical (TRUE/FALSE), and other inputs.

For example, this age_pyramid() command produces an age pyramid graphic based on defined age groups and a binary split variable, such as gender. The function is given three arguments within the parentheses, separated by commas. The values supplied to the arguments establish linelist as the data frame to use, age_group as the variable to count, and gender as the binary variable to use for splitting the pyramid by color.

NOTE: For this example, in the background we have created a new variable called “age_group”. To learn how to create new variable see that section of this handbook

# Creates an age pyramid by specifying the dataframe, age group variable, and a variable to split the pyramid
apyramid::age_pyramid(data = linelist, age_group = "age_group", split_by = "gender")

The first half of an argument assignment (e.g. data =) does not need to be specified if the arguments are written in a specific order (specified in the function’s documentation). The below code produces the exact same pyramid as above, because the function expects the argument order: data frame, age_group variable, split_by variable.

# This command will produce the exact same graphic as above
apyramid::age_pyramid(linelist, "age_group", "gender")

A more complex age_pyramid() command might include the optional arguments to:

  • Show proportions instead of counts (set proportional = TRUE when the default is FALSE)
  • Specify the two colors to use (pal = is short for “palette” and is supplied with a vector of two color names. See the objects page for how the function c() makes a vector)

NOTE: For arguments specified with an equals symbol (e.g. coltotals = ...), their order among the arguments is not important (must still be within the parentheses and separated by commas).

age_pyramid(linelist, "age_group", "gender", proportional = TRUE, pal = c("orange", "purple"))

Packages

Packages contain functions.

On installation, R contains “base” functions that perform common elementary tasks. But many R users create specialized functions, which are verified by the R community and which you can download as a package for your own use.

One of the more challenging aspects of R is that there are often many functions or packages to choose from to complete a given task.

Functions are contained within packages which can be downloaded (“installed”) to your computer from the internet. Once a package is downloaded, you access its functions by loading the package with the library() command at the beginning of each R session.

# this loads the package "tidyverse" for use in the current R session
library(tidyverse)

NOTE: While you only have to install a package once, you must load it at the beginning of every R session using library() command, or an alternative like pacman’s p_load() function.

Think of R as your personal library: When you download a package your library gains a book of functions, but each time you want to use a function in that book, you must borrow that book from your library.

For clarity in this handbook, functions are usually preceeded by the name of their package using the :: symbol in the following way:

package_name::function_name()

Once a package is loaded for a session, this explicit style is not necessary. One can just use function_name(). However giving the package name is useful when a function name is common and may exist in multiple packages (e.g. plot()).
Using the package name will also load the package if it is not already loaded.

# This command uses the package "rio" and its function "import()" to import a dataset
linelist <- rio::import("linelist.xlsx", which = "Sheet1")

Dependencies
Packages often depend on other packages, and these are called “dependencies”. When a package is installed from CRAN, it will typically also install its dependenices.

Function Help

To read more about a function, you can try searching online for resources OR search in the Help tab of the lower-right RStudio pane.

Piping

Piping (%>%)

Two general approaches to R coding are:

  1. Tidyverse - piping an object from function to function
  2. defining intermediate objects (and older method, still worth knowing about)

Piping

Simply explained, the pipe operator (%>%) passes an intermediate output from one function to the next.
You can think of it as saying “then”. Many functions can be linked together with %>%.

  • Piping emphasizes a sequence of actions, not the object the actions are being performed on

  • Best when a sequence of actions must be performed on one object

  • from magrittr. Included in dplyr and tidyverse

  • Makes code more clean and easier to read, intuitive

  • express a sequence of operations

  • the object is altered and then passed on to the next function

Example:

# A fake example of how to bake a care using piping syntax

cake <- flour %>%       # to define cake, start with flour, and then...
  left_join(eggs) %>%   # add eggs
  left_join(oil) %>%    # add oil
  left_join(water) %>%  # add water
  mix_together(utensil = spoon, minutes = 2) %>%                # mix together
  bake(degrees = 350, system = "fahrenheit", minutes = 35) %>%  # bake
  let_cool()            # let it cool down

https://cfss.uchicago.edu/notes/pipes/#:~:text=Pipes%20are%20an%20extremely%20useful,code%20and%20combine%20multiple%20operations.

Piping is not a base function. To use piping, the dplyr package must be installed and loaded. Near the top of every template script is a code chunk that installs and loads the necessary packages, including dplyr. You can read more about piping in the documentation.

CAUTION: Remember that even when using piping to link functions, if the assignment operator (<-) is present, the object to the left will still be over-written (re-defined) by the right side.

TODO %<>% shortcut for re-defining the object and piping

Intermediate objects

Better if:
* You need to manipulate multiple objects
* There are intermediate steps that are meaningful and deserve separate object names

as changes are made - still handy to know

Risks: creating new objects for each step - lots of objects. If you use the wrong one you might not know. naming can be confusing, errors not easily detectable

either name each intermediate object, or overwrite the original, or combine all the functions together. all come with risks

https://style.tidyverse.org/pipes.html

# a fake example of how to bake a cake using this method (defining intermediate objects)
batter_1 <- left_join(flour, eggs)
batter_2 <- left_join(batter_1, oil)
batter_3 <- left_join(batter_2, water)

batter_4 <- mix_together(object = batter_3, utensil = spoon, minutes = 2)

cake <- bake(batter_4, degrees = 350, system = "fahrenheit", minutes = 35)

cake <- let_cool(cake)

Combine all functions together - also difficult to read

# an example of combining/nesting mutliple functions together - difficult to read
cake <- let_cool(bake(mix_together(batter_3, utensil = spoon, minutes = 2), degrees = 350, system = "fahrenheit", minutes = 35))

Key operators and functions

Key operators and functions

This section details operators in R, such as:
* Relational operators (less than, equal too..)
* Logical operators (and, or…)
* Handling missing values
* Mathematical operators and functions (+/-, >, sum(), median(), …)
* The %in% operator

Relational and logical operators

Relational operators compare values and are often used when defining new variables and subsets of datasets. Here are the common relational operators in R:

Function Operator Example Example Result
Equal to == "A" == "a" FALSE (because R is case sensitive) Note that == (double equals) is different from = (single equals), which acts like the assignment operator <-
Not equal to != 2 != 0 TRUE
Greater than > 4 > 2 TRUE
Less than < 4 < 2 FALSE
Greater than or equal to >= 6 >= 4 TRUE
Less than or equal to <= 6 <= 4 FALSE
Value is missing is.na() is.na(7) FALSE (see section on missing values)
Value is not missing !is.na() !is.na(7) TRUE

Logical operators, such as AND and OR, are often used to connect relational operators and create more complicated criteria. Complex statements might require parentheses ( ) for grouping and order of application.

Function Operator
AND &
OR | (vertical bar)
Parentheses ( ) Used to group criteria together and clarify order

For example, below, we have a linelist with two variables we want to use to create our case definition, hep_e_rdt, a test result and other_cases_in_hh, which will tell us if there are other cases in the household. The command below uses the function case_when() to create the new variable case_def such that:

linelist_cleaned <- linelist_cleaned %>%
  mutate(case_def = case_when(
    is.na(hep_e_rdt) & is.na(other_cases_in_hh)           ~ NA_character_,
    hep_e_rdt == "Positive"                               ~ "Confirmed",
    hep_e_rdt != "Positive" & other_cases_in_hh == "Yes"  ~ "Probable",
    TRUE                                                  ~ "Suspected"
  ))
Criteria in example above Resulting value in new variable “case_def”
If the value for variables hep_e_rdt and other_cases_in_hh are missing NA (missing)
If the value in hep_e_rdt is “Positive” “Confirmed”
If the value in hep_e_rdt is NOT “Positive” AND the value in other_cases_in_hh is “Yes” “Probable”
If one of the above criteria are not met “Suspected”

{{% notice tip %}} Note that R is case-sensitive, so “Positive” is different than “positive”… {{% /notice %}}

Missing Values

In R, missing values are represented by the special value NA (capital letters N and A - not in quotation marks). If you import data that records missing data in another way (e.g. 99, “Missing”, or .), you may want to re-code those values to NA.

To test whether a value is NA, use the special function is.na(), which returns TRUE or FALSE.

rdt_result <- c("Positive", "Suspected", "Positive", NA)   # two positive cases, one suspected, and one unknown
is.na(rdt_result)  # Tests whether the value of rdt_result is NA
## [1] FALSE FALSE FALSE  TRUE

To DO: SECTION ON OTHER NA TYPES: NA_character, NA_real etc. SECTION ON NULL

Mathematics and statistics

All the operators and functions in this page is automatically available using base R.

Mathematical operators

These are often used to perform addition, division, to create new columns, etc. Below are common mathematical operators in R. Whether you put spaces around the operators is not important.

Objective Example in R
addition 2 + 3
subtraction 2 - 3
multiplication 2 * 3
division 30 / 5
exponent 2^3
order of operations ( )

Mathematical functions

Objective Function
rounding round(x, digits = n)
ceiling (round up) ceiling(x)
floor (round down) floor(x)
absolute value abs(x)
square root sqrt(x)
exponent exponent(x)
natural logarithm log(x)

Statistical functions:

CAUTION: The functions below will by default include missing values in calculations. Missing values will result in an output of NA, unless the argument na.rm=TRUE is specified

Objective Function
mean (average) mean(x, na.rm=T)
median median(x, na.rm=T)
standard deviation sd(x, na.rm=T)
quantiles* quantile(x, probs)
sum sum(x, na.rm=T)
minimum value min(x, na.rm=T)
maximum value max(x, na.rm=T)
range of numeric values range(x, na.rm=T)
summmary** summary(x)

*quantile(): x is the numeric vector to examine, and probs is a numeric vector with probabilities within 0 and 1.0, e.g c(0.5, 0.8, 0.85)

**summary(): gives a summary on a numeric vector including mean, median, and common percentiles

Other useful functions:

Objective Function Example
create a sequence seq(from, to, by) seq(1, 10, 2)
repeat x, n times rep(x, ntimes) rep(1:3, 2) or rep(c("a", "b", "c"), 3)
subdivide a numeric vector cut(x, n) cut(linelist$age, 5)

%in%

Loading Packages

Loading Packages

This section describes the several ways to install a package:
* Via the online package repository (CRAN)
* From a ZIP file
* From Github

CRAN

ZIP files

Github

Errors & Warnings

Errors & Warnings

This section explains:
* General syntax for writing R code
* Code assists
* the difference between errors and warnings

Common errors and warnings and their solutions can be found in X section (TODO).

General Syntax

A few things to remember when writing commands in R, to avoid errors and warnings:

  • Always close parentheses - tip: count the number of opening “(” and closing parentheses “)” for each code chunk
  • Avoid spaces in column and object names. Use underscore ( _ ) or periods ( . ) instead
  • Keep track of and remember to separate a function’s arguments with commas
  • R is case-sensitive, meaning Variable_A is different from variable_A

Code assists

Any script (RMarkdown or otherwise) will give clues when you have made a mistake. For example, if you forgot to write a comma where it is needed, or to close a parentheses, RStudio will raise a flag on that line, on the right side of the script, to warn you.

(/images/Warnings_and_Errors.png)

Errors and Warnings

When a command is run, the R Console may show you warning or error messages in red text.

  • A warning means that R has completed your command, but had to take additional steps or produced unusual output that you should be aware of.

  • An error means that R was not able to complete your command.

Look for clues:

  • The error/warning message will often include a line number for the problem.

  • If an object “is unknown” or “not found”, perhaps you spelled it incorrectly, forgot to call a package with library(), or forgot to re-run your script after making changes.

If all else fails, copy the error message into Google along with some key terms - chances are that someone else has worked through this already!

Recommended training

Importing data

Overview

Introduction to importing data

Packages

The key package we recommend for importing data is: rio. rio offers the useful function import() which can import many types of files into R.

The alternative to using rio would be to use functions from several other packages that are specific to a type of file (e.g. read.csv(), read.xlsx(), etc.). While these alternatives can be difficult to remember, always using rio::import() is relatively easy.

Optionally, the package here can be used in conjunction with rio. It locates files on your computer via relative pathways, usually within the context of an R project. Relative pathways are relative from a designated folder location, so that pathways listed in R code will not break when the script is run on a different computer.

This code chunk shows the loading of packages for importing data.

# Checks if package is installed, installs if necessary, and loads package for current session
pacman::p_load(rio, here)

import()

When you import a dataset, you are doing the following:

  1. Creating a new, named data frame object in your R environment
  2. Defining the new object as the imported dataset

The function import() from the package rio makes it easy to import many types of data files.

# An example:
#############
library(rio)                                                     # ensure package rio is loaded for use

# New object is defined as the imported data
my_csv_data <- import("linelist.csv")                            # importing a csv file

my_Excel_data <- import("observations.xlsx", which = "February") # import an Excel file

import() uses the file’s extension (e.g. .xlsx, .csv, .dta, etc.) to appropriately import the file. Any optional arguments specific to the filetype can be supplied as well.

You can read more about the rio package in this online vignette

CAUTION: In the example above, the datasets are assumed to be located in the working directory, or the same folder as the script.

TO DO

import a specific range of cells skip rows, in excel and csv rio table of functions used for import/export/convert https://cran.r-project.org/web/packages/rio/vignettes/rio.html other useful function to know as backup EpiInfo SAS STATA Google Spreadsheets R files

Import from filepath

A filepath can be provided in full (as below) or as a relative filepath (see next tab). Providing a full filepath can be fast and may be the best if referencing files from a shared/network drive).

The function import() (from the package rio) accepts a filepath in quotes. A few things to note:

  • Slashes must be forward slashes, as in the code shown. This is NOT the default for Windows filepaths.
  • Filepaths that begin with double slashes (e.g. “//…”) will likely not be recognized by R and will produce an error. Consider moving these files to a “named” or “lettered” drive that begins with a letter (e.g. “J:” or “C:”). See the section on using Network Drive for more details on this issue.
# A demonstration showing how to import a specific Excel sheet
my_data <- rio::import("C:/Users/Neale/Documents/my_excel_file.xlsx")

Import Excel sheet

If importing a specific sheet from an Excel file, include the sheet name in the which = argument of import(). For example:

# A demonstration showing how to import a specific Excel sheet
my_data <- rio::import("my_excel_file.xlsx", which = "Sheetname")

If using the here() method to provide a relative pathway to import(), you can still indicate a specific sheet by adding the which = argument after the closing parenthese of the here() function.

# Demonstration: importing a specific Excel sheet when using relative pathways with the 'here' package
linelist_raw <- import(here("data", "linelists", "linelist.xlsx"), which = "Sheet1")`  

Select file manually

You can import data manually via one of these methods:

  • Environment RStudio Pane, click “Import Dataset”, and select the type of data
  • Click File / Import Dataset / (select the type of data)
  • To hard-code manual selection, use the base R command file.choose() (leaving the parentheses empty) to trigger appearance of a pop-up window that allows the user to manually select the file from their computer. For example:
# A demonstration showing manual selection of a file. When this command is run, a POP-UP window should appear. 
# The filepath of the selected file will be supplied to the import() command.

my_data <- rio::import(file.choose())

TIP: The pop-up window may appear BEHIND your RStudio window.

Relative filepaths (here())

Relative filepaths differ from static filepaths in that they are relative from a R project root directory. For example:

  • A static filepath: import("C:/Users/nsbatra/My Documents/R files/epiproject/data/linelists/ebola_linelist.xlsx")
    • Specific fixed path
    • Useful if multiple users are running a script hosted on a network drive
  • A relative filepath: import(here("data", "linelists", "ebola_linelist.xlsx"))
    • Path is given in relation to a root directory (typically the root folder of an R project)
    • Best if working within an R project, or planning to zip and share entire project with others

The package here and it’s function here() facilitate relative pathways.

here() works best within R projects. When the here package is first loaded (library(here)), it automatically considers the top-level folder of your R project as “here” - a benchmark for all other files in the project.

Thus, in your script, if you want to import or reference a file saved in your R project’s folders, you use the function here() to tell R where the file is in relation to that benchmark.

If you are unsure where “here” is set to, run the function here() with the empty brackets:

# This command tells you the folder path that "here" is set to 
here::here()

Below is an example of importing the file “fluH7N9_China_2013.csv” which is located in the benchmark “here” folder. All you have to do is provide the name of the file in quotes (with the appropriate ending).

linelist <- import(here("fluH7N9_China_2013.csv"))

If the file is within a subfolder - let’s say a “data” folder - write these folder names in quotes, separated by commas, as below:

linelist <- import(here("data", "fluH7N9_China_2013.csv"))

Using the here() command produces a character filepath, which can then processed by the import() function.

# the filepath
here("data", "fluH7N9_China_2013.csv")

# the filepath is given to the import() function
linelist <- import(here("data", "fluH7N9_China_2013.csv"))

NOTE: You can still import a specific sheet of an excel file as noted in the Excel tab. The here() command only supplies the filepath.

Manual data entry

Entry by columns

Since a data frame is a combination of vertical vectors (columns), R by default expects manual entry of data to also be in vertical vectors (columns).

# define each vector (vertical column) separately, each with its own name
PatientID <- c(235, 452, 778, 111)
Treatment <- c("Yes", "No", "Yes", "Yes")
Death     <- c(1, 0, 1, 0)

CAUTION: All vectors must be the same length (same number of values).

The vectors can then be bound together using the function data.frame():

# combine the columns into a data frame, by referencing the vector names
manual_entry_cols <- data.frame(PatientID, Treatment, Death)

And now we display the new dataset:

# display the new dataset
DT::datatable(manual_entry_cols)

Entry by rows

Use the tribble function from the tibble package from the tidverse (onlinetibble reference).

Note how column headers start with a tilde (~). Also note that each column must contain only one class of data (character, numeric, etc.).
You can use tabs, spacing, and new rows to make the data entry more intuitive and readable. For example:

# create the dataset manually by row
manual_entry_rows <- tibble::tribble(
                        ~colA, ~colB,
                        "a",   1,
                        "b",   2,
                        "c",   3
                      )

And now we display the new dataset:

# display the new dataset
DT::datatable(manual_entry_rows)

OR ADD ROWS dplyr TO DO

Pasting from clipboard

If you copy data from elsewhere and have it on your clipboard, you can try the following command to convert those data into an R data frame:

manual_entry_clipboard <- read.table(file = "clipboard",
                                     sep = "t",           # separator could be tab, or commas, etc.
                                     header=TRUE)         # if there is a header row

Working with Dates

Overview

  • It is important to make R recognize when a variable contains dates.
  • Dates are an object class and can be tricky to work with.
  • Here we present several ways to convert date variables to Date class.

Packages

The following packages are recommended for working with dates:

# Checks if package is installed, installs if necessary, and loads package for current session

pacman::p_load(aweek,      # flexibly converts dates to weeks, and vis-versa
               lubridate,  # for conversions to months, years, etc.
               linelist,   # function to guess messy dates
               ISOweek)    # another option for creating weeks

as.Date()

The standard, base R function to convert an object or variable to class Date is as.Date() (note capitalization).

as.Date() requires that the user specify the existing* format of the date*, so it can understand, convert, and store each element (day, month, year, etc.) correctly. Read more online about as.Date().

If used on a variable, as.Date() therefore requires that all the character date values be in the same format before converting. If your data are messy, try cleaning them or consider using guess_dates() from the linelist package.

It can be easiest to first convert the variable to character class, and then convert to date class:

  1. Turn the variable into character values using the function as.character()
linelist_cleaned$date_of_onset <- as.character(linelist_cleaned$date_of_onset)
  1. Convert the variable from character values into date values, using the function as.Date()
    (note the capital “D”)
  • Within the as.Date() function, you must use the format= argument to tell R the current format of the date components - which characters refer to the month, the day, and the year, and how they are separated. If your values are already in one of R’s standard date formats (YYYY-MM-DD or YYYY/MM/DD) the format= argument is not necessary.

    • The codes are:
      %d = Day # (of the month e.g. 16, 17, 18…)
      %a = abbreviated weekday (Mon, Tues, Wed, etc.)
      %A = full weekday (Monday, Tuesday, etc.)
      %m = # of month (e.g. 01, 02, 03, 04)
      %b = abbreviated month (Jan, Feb, etc.)
      %B = Full Month (January, February, etc.)
      %y = 2-digit year (e.g. 89)
      %Y = 4-digit year (e.g. 1989)

For example, if your character dates are in the format DD/MM/YYYY, like “24/04/1968”, then your command to turn the values into dates will be as below. Putting the format in quotation marks is necessary.

linelist_cleaned$date_of_onset <- as.Date(linelist_cleaned$date_of_onset, format = "%d/%m/%Y")

TIP: The format= argument is not telling R the format you want the dates to be, but rather how to identify the date parts as they are before you run the command.

TIP:Be sure that in the format= argument you use the date-part separator (e.g. /, -, or space) that is present in your dates.

The as.character() and as.Date() commands can optionally be combined as:

linelist_cleaned$date_of_onset <- as.Date(as.character(linelist_cleaned$date_of_onset), format = "%d/%m/%Y")

If using piping and the tidyverse, the above command might look like this:

linelist_cleaned <- linelist_cleaned %>%
  mutate(date_of_onset = as.character(date_of_onset),
         date_of_onset = as.Date(date_of_onset, format = "%d/%m/%Y"))

Once complete, you can run a command to verify the class of the variable

# Check the class of the variable
class(linelist_cleaned$date_of_onset)  

Once the values are in class Date, R will by default display them in the standard format, which is YYYY-MM-DD.

lubridate

This is a section on using lubridate (Henry)

guess_dates()

The function guess_dates() attempts to read a “messy” date variable containing dates in many different formats and convert the dates to a standard format. You can read more online about guess_dates(), which is in the linelist package.

For example: guess_dates would see the following dates “03 Jan 2018”, “07/03/1982”, and “08/20/85” and convert them in the class Date to: 2018-01-03, 1982-03-07, and 1985-08-20.

linelist::guess_dates(c("03 Jan 2018", "07/03/1982", "08/20/85")) # guess_dates() not yet available on CRAN for R 4.0.2
                                                                  # try install via devtools::install_github("reconhub/linelist")

Some optional arguments for guess_dates() that you might include are:

  • error_tolerance - The proportion of entries which cannot be identified as dates to be tolerated (defaults to 0.1 or 10%)
  • last_date - the last valid date (defaults to current date)
  • first_date - the first valid date. Defaults to fifty years before the last_date.
# An example using guess_dates on the variable dtdeath
data_cleaned <- data %>% 
  mutate(dtdeath = linelist::guess_dates(dtdeath, error_tolerance = 0.1, first_date = "2016-01-01")

Excel Dates

Excel stores dates as the number of days since December 30, 1899. If the dataset you imported from Excel shows dates as numbers or characters like “41369”… use the as.Date() function to convert, but instead of supplying a format as above, supply an origin date.

NOTE: You should provide the origin date in R’s default date format ("YYYY-MM-DD").

# An example of providing the Excel 'origin date' when converting Excel number dates
data_cleaned <- data %>% 
  mutate(date_of_onset = as.Date(date_of_onset, origin = "1899-12-30"))

How dates are displayed

Once dates are the correct class, you often want them to display differently (e.g. in a plot, graph, or table). For example, to display as “Monday 05 Jan” instead of 2018-01-05. You can do this with the function format(), which works in a similar way as as.Date(). Read more in this online tutorial

%d = Day # (of the month e.g. 16, 17, 18…) %a = abbreviated weekday (Mon, Tues, Wed, etc.)
%A = full weekday (Monday, Tuesday, etc.)
%m = # of month (e.g. 01, 02, 03, 04)
%b = abbreviated month (Jan, Feb, etc.)
%B = Full Month (January, February, etc.)
%y = 2-digit year (e.g. 89)
%Y = 4-digit year (e.g. 1989)
%h = hours (24-hr clock)
%m = minutes
%s = seconds %z = offset from GMT
%Z = Time zone (character)

An example of formatting today’s date:

# today's date, with formatting
format(Sys.Date(), format="%d %B %Y")
## [1] "27 December 2020"

# easy way to get full date and time (no formatting)
date()
## [1] "Sun Dec 27 15:30:19 2020"

# formatted date, time, and time zone (using paste0() function)
paste0(format(Sys.Date(), format= "%A, %b %d '%y, %z  %Z, "), format(Sys.time(), format = "%H:%M:%S"))
## [1] "Sunday, Dec 27 '20, +0000  UTC, 15:30:19"

Calculating distance between dates

The difference between dates can be calculated by:

  1. Correctly formating both date variable as class date (see instructions above)
  2. Creating a new variable that is defined as one date variable subtracted from the other
  3. Converting the result to numeric class (default is class “datediff”). This ensures that subsequent mathematical calculations can be performed.

Converting dates/time zones

TODO

Epidemiological weeks

The templates use the very flexible package aweek to set epidemiological weeks. You can read more about it on the RECON website

Dates in Epicurves

See the section on epicurves.

Dates miscellaneous

  • Sys.Date( ) returns the current date of your computer
  • Sys.Time() returns the current time of your computer
  • date() returns the current date and time.

Cleaning data

Overview

This page demonstrates common steps necessary to clean a dataset. It uses a simulated Ebola case linelist, which is used throughout the handbook.

HOW TO READ: To emphasize the tidyverse coding approach, each cleaning step is explained individually and then incorporated into a “cleaning pipeline” - a series of cleaning actions linked together sequentially through pipes (LINK TO PIPES). The pipe begins with the “raw” data (linelist_raw) and ends with a “clean” dataset (linelist).

The cleaning steps demonstrated include:

  • Loading the data
  • column name cleaning
  • column selection
  • Designating column classes
  • Filtering rows
  • Re-coding values
  • Creating groups (case_when())
  • Dealing with character case (upper, lower, title, etc.)
  • Factor columns

replace missing with dealing with cases (all lower, etc) case_when() factors

Preparation

Load packages

pacman::p_load(tidyverse,  # data manipulation and visualization
               janitor,    # data cleaning
               epitrix     # data cleaning
               )

Load data

Import the raw dataset using the import() function from the package rio. (LINK HERE TO IMPORT PAGE)

linelist_raw <- import("ebola_simulated.xlsx")

You can view the first 50 rows of the the original “raw” dataset below:

# display the linelist data as a table
DT::datatable(head(linelist_raw,50), rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Cleaning pipeline

In epidemiological analysis and data processing, cleaning steps are often performed together and sequentially. In R this often manifests as a cleaning “pipeline”, where the raw dataset is passed or “piped” from one cleaning step to another. The chain often utilizes dplyr verbs and the magrittr pipe operator (see handbook page on dplyr and tidyverse coding style (LINK HERE).

In a cleaning pipeline the order of the steps is important. Cleaning steps may include:

  • Column names may be cleaned or changed
  • Rows may be filtered or added
  • Columns may be selected, added, transformed, or re-ordered
  • Values may be re-coded, cleaned, or grouped

Column names

Column names are used very often so they need to have “clean” syntax. We suggest the following:

  • Short names
  • No spaces (replaced with underscores (_),
  • No unusual characters (&, #…)
  • Similar style nomenclature (e.g. all date columns named like date_onset, date_report, date_death…)

The columns names of linelist_raw are below. We can see that there are some with spaces. We also have different naming patterns for dates (‘date onset’ and ‘infection date’).

Also note that in the raw data, the two final columns names were two merged cells with one name. The import() function used the name for the first of the two columns, and assigned the second column the name “…23” as it was then empty (referring to the 23rd column).

names(linelist_raw)
##  [1] "row_num"         "case_id"         "generation"      "infection date" 
##  [5] "date onset"      "hosp date"       "date_of_outcome" "outcome"        
##  [9] "gender"          "hospital"        "lon"             "lat"            
## [13] "infector"        "source"          "age"             "age_unit"       
## [17] "fever"           "chills"          "cough"           "aches"          
## [21] "vomit"           "merged_header"   "...23"
Note: For a column name that include spaces, surround the name with back-ticks, for example: linelist$`infection date`. On a keyboard, the back-tick (`) is different from the single quotation mark ('), and is sometimes on the same key as the tilde (~).

Automatic syntax cleaning

The function clean_names() from the package janitor standardizes column names and makes them unique by doing the following:

  • Converts all names to consist of only underscores, numbers, and letters
  • Accented characters are transliterated to ASCII (e.g. german o with umlaut becomes “o”, spanish “enye” becomes “n”)
  • Capitalization preference can be specified using the case = argument (“snake” is default, alternatives include “sentence”, “title”, “small_camel”…)
  • You can designate specific name replacements with the replace = argument (e.g. replace = c(onset = “date_of_onset”))
  • Here is an online vignette

Below, the cleaning pipeline begins by using clean_names() on the raw linelist.

# send the dataset through the function clean_names()
linelist <- linelist_raw %>% 
  janitor::clean_names()

# see the new names
names(linelist)
##  [1] "row_num"         "case_id"         "generation"      "infection_date" 
##  [5] "date_onset"      "hosp_date"       "date_of_outcome" "outcome"        
##  [9] "gender"          "hospital"        "lon"             "lat"            
## [13] "infector"        "source"          "age"             "age_unit"       
## [17] "fever"           "chills"          "cough"           "aches"          
## [21] "vomit"           "merged_header"   "x23"

NOTE: The column name “…23” was changed to “x23”.

Manual column name cleaning

Re-naming columns manually is often necessary. Below, re-naming is performed using the rename() function from the dplyr package, as part of a pipe chain. rename() uses the style “NEW = OLD”, the new column name is given before the old column name.

Below, a re-name command is added to the cleaning pipeline:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome)

Now you can see that the columns names have been changed:

# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

You can also rename by column position, instead of column name, for example:

rename(newNameForFirstColumn = 1,
       newNameForSecondColumn = 2)

Empty Excel column names

If you importing an Excel sheet with a missing column name, depending on the import function used, R will likely create a column name with a value like “…1” or “…2”. You can clean these names manually by referencing their position number (see above), or their name (linelist_raw$...1).

Merged Excel column names

Merged cells in an Excel file are a common occurrance when receiving data from field level. Merged cells can be nice for human reading of data, but cause many problems for machine reading of data. R cannot accomodate merged cells. If at all possible, try to change procedures so that data arrive in a tidy format without merged cells.

When using rio’s import() function, the value in a merged cell will be assigned to the first cell and subsequent cells will be empty.

One solution to deal with merged cells is to import the data with the function readWorkbook() from package openxlsx. Set the argument fillMergedCells = TRUE. This gives the value in a merged cell to all cells within the merge range.

linelist_raw <- openxlsx::readWorkbook(here("data", "ebola_simulated.xlsx"), fillMergedCells = TRUE)

DANGER: If column names are merged, you will end up with duplicate column names, which you will need to fix manually - R does not work well with duplicate column names! You can re-name them by referencing their position (e.g. column 5), as explained in the section on manual column name cleaning..

Skip import of column names

Sometimes, you may want to avoid importing a row of data. Using import() from the rio package on a .xlsx file, you can do this with the argument skip =. Provide the number of rows you want to skip.

linelist_raw <- import("ebola_simulated.xlsx", skip = 1)  # does not import header row

Second header row

You may need to avoid importing the second row of data, for example if it is a data dictionary row (as in the example linelist). This can be problematic because it can result in all columns being imported as class “character”. To solve this, you will likely need to import the data twice.

  1. Import the data in order to store the correct column names
  2. Import the data again, skipping the first two rows (header and second rows)
  3. Bind the correct names onto the reduced dataframe

The exact arguments used to bind the correct column names depends on the type of data file (.csv, .tsv, .xlsx, etc.). If using rio’s import() function, understand which function rio uses to import your data, and then give the appropriate argument to skip lines and/or designate the column names. See the handbook page on importing data (LINK) for details on rio.

For Excel files:

# REMOVE 2nd ROW (DATA DICTIONARY)
linelist_raw_names <- import(here::here("data", "ebola_simulated.xlsx")) %>% names()
linelist_raw <- import(here::here("data", "ebola_simulated.xlsx"), skip = 2, col_names = linelist_raw_names)
# For excel files
linelist_raw_names <- import("ebola_simulated.xlsx") %>% names()
linelist_raw <- import("ebola_simulated.xlsx", skip = 2, col_names = linelist_raw_names) # argument is 'col_names'

For CSV files:

# For csv files
linelist_raw_names <- import("ebola_simulated.csv") %>% names()
linelist_raw <- import("ebola_simulated.csv", skip = 2, col.names = linelist_raw_names) # argument is 'col.names'

Backup option - changing column names as a separate command

# assign/overwrite headers using the base 'colnames()' function
colnames(linelist_raw) <- linelist_raw_names

Bonus! If you do have a second row that is a data dictionary, you can easily create a proper data dictionary from it using the gather() command from the tidyr package.
source: https://alison.rbind.io/post/2018-02-23-read-multiple-header-rows/

TO DO

library(tidyr)
stickers_dict <- import("ebola_simulated.xlsx") %>% 
  clean_names() %>% 
  gather(variable_name, variable_description)
stickers_dict

Combine two header rows

In some cases, you may want to combine two header rows into one. This command will define the column names as the combination (pasting together) of the existing column names with the value underneath in the first row.

names(df) <- paste(names(df), df[1, ], sep = "_")

Select or re-order columns

CAUTION: This tab may follow from previous tabs.

Often the first step of cleaning data is selecting the columns you want to work with, and to set their order in the dataframe. In a dplyr chain of verbs, this is done with select(). Note that in these examples we modify linelist with select(), but do not assign/overwrite. We just display the resulting new column names, for purpose of example.

CAUTION: In the examples below, linelist is modified with select() but not over-written. New column names are only displayed for purpose of example.

Here are all the column names in the linelist:

names(linelist)
##  [1] "row_num"              "case_id"              "generation"          
##  [4] "date_infection"       "date_onset"           "date_hospitalisation"
##  [7] "date_outcome"         "outcome"              "gender"              
## [10] "hospital"             "lon"                  "lat"                 
## [13] "infector"             "source"               "age"                 
## [16] "age_unit"             "fever"                "chills"              
## [19] "cough"                "aches"                "vomit"               
## [22] "merged_header"        "x23"

With select() you can do the following:

Select only the columns you want to remain, and their order of appearance

# linelist dataset is piped through select() command, and names() prints just the column names
linelist %>% 
  select(case_id, date_onset, date_hospitalisation, fever) %>% 
  names() # display the column names
## [1] "case_id"              "date_onset"           "date_hospitalisation"
## [4] "fever"

Indicate which columns to remove by placing a minus symbol “-” in front of the column name (e.g. select(-outcome)), or a vector of column names (as below). All other columns will be retained. Inside select() you can use normal operators such as c() to list several columns, : for consecutive columns, ! for opposite, & for AND, and | for OR.

linelist %>% 
  select(-c(date_onset, fever:vomit)) %>% # remove onset and all symptom columns
  names()
##  [1] "row_num"              "case_id"              "generation"          
##  [4] "date_infection"       "date_hospitalisation" "date_outcome"        
##  [7] "outcome"              "gender"               "hospital"            
## [10] "lon"                  "lat"                  "infector"            
## [13] "source"               "age"                  "age_unit"            
## [16] "merged_header"        "x23"

Re-order the columns - use everything() to signify all other columns not specified in the select() command:

# move case_id, date_onset, date_hospitalisation, and gender to beginning
linelist %>% 
  select(case_id, date_onset, date_hospitalisation, gender, everything()) %>% 
  names()
##  [1] "case_id"              "date_onset"           "date_hospitalisation"
##  [4] "gender"               "row_num"              "generation"          
##  [7] "date_infection"       "date_outcome"         "outcome"             
## [10] "hospital"             "lon"                  "lat"                 
## [13] "infector"             "source"               "age"                 
## [16] "age_unit"             "fever"                "chills"              
## [19] "cough"                "aches"                "vomit"               
## [22] "merged_header"        "x23"

As well as everything() there are several special functions that work within select(), namely:

  • everything() - all other columns not mentioned
  • last_col() - the last column
  • where() - applies a function to all columns and selects those which are TRUE
  • starts_with() - matches to a specified prefix. Example: select(starts_with("date"))
  • ends_with() - matches to a specified suffix. Example: select(ends_with("_end"))
  • contains() - columns containing a character string. Example: select(contains("time"))
  • matches() - to apply a regular expression (regex). Example: select(contains("[pt]al"))
  • num_range() -
  • any_of() - matches if column is named. Useful if the name might not exist. Example: select(any_of(date_onset, date_death, cardiac_arrest))

Here is an example using where():

# select columns containing certain characters
linelist %>% 
  select(contains("date")) %>% 
  names()
## [1] "date_infection"       "date_onset"           "date_hospitalisation"
## [4] "date_outcome"
# searched for multiple character matches
linelist %>% 
  select(matches("onset|hosp|fev")) %>%   # note the OR symbol "|"
  names()
## [1] "date_onset"           "date_hospitalisation" "hospital"            
## [4] "fever"

Adding select()to the cleaning pipe chain:

In the linelist, there are a few columns we do not need: row_num, merged_header, and x23. Remove them by adding a select() command to the cleaning pipe chain:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# remove 2nd row, which contains data dictionary values
#######################################################
    # store column names
    linelist_raw_names <- import(here::here("data", "ebola_simulated.xlsx")) %>% names()
    
    # import raw dataset, skipping 2nd row (data dictionary row) and re-attaching column names
    linelist_raw <- import(here::here("data", "ebola_simulated.xlsx"), skip = 2, col_names = linelist_raw_names)

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-c(row_num, merged_header, x23))

select() as a stand-alone command

Select can also be used as an independent command (not in a pipe chain). In this case, the first argument is the original dataframe to be operated upon.

# Create a new linelist with id and age-related columns
linelist_age <- select(linelist, case_id, contains("age"))

# display the column names
names(linelist_age)
## [1] "case_id"  "age"      "age_unit"

Add columns and rows

See the tabs below to add columns and rows

Add columns

mutate()

We advise creating new columns with dplyr functions as part of a chain of such verb functions (e.g. filter, mutate, etc.)
If in need of a stand-alone command, you can use mutate() or the base R style to create a new column (see below).

The verb mutate() is used to add a new column, or to modify an existing one. Below are some example of creating new columns with mutate(). The syntax is: new_column_name = value or function. It is best practice to separate each new column with a comma and new line.

linelist <- linelist %>%                       # creating new, or modifying old dataset
  mutate(new_var_dup    = case_id,             # new column = duplicate/copy another column
         new_var_static = 7,                   # new column = all values the same
         new_var_static = new_var_static + 5,  # you can overwrite a column, and it can be a calculation using other variables
         new_var_paste  = stringr::str_glue("{hospital} on ({date_hospitalisation})") # new column = pasting together values from other columns
         ) 

Scroll to the right to see the new columns:

# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

TIP: The verb transmute() adds new columns just like mutate() but also drops/removes all other columns that you do not mention.

New columns using base R

To define a new column (or re-define a column) using base R, just use the assignment operator as below. Remember that when using base R you must specify the dataframe before writing the column name (e.g. dataframe$column). Here are two dummy examples:

linelist$old_var <- linelist$old_var + 7
linelist$new_var <- linelist$old_var + linelist$age

Add rows

TO DO

Remember that each column must contain values of only one class (either character, numeric, logical, etc.). So adding a row requires nuance to maintain this.

linelist <- linelist %>% 
  add_row(row_num = 666, case_id = "abc", generation = 4, `infection date` = as.Date("2020-10-10"), .before = 2)

use .before and .after. .before = 3 will put it before the 3rd row. Default is to add it to the end. columns not specified will be let empty. The new row number may look strange (“…23”) but the row numbers have changed. So if using the command twice examine/test carefully.

If your class is off you will see an error like this: Error: Can’t combine ..1$infection date and ..2$infection date . (for a date value remember to wrap the date in the functionas.Date() like as.Date("2020-10-10"))

New columns using grouped values

CAUTION: This tab may follow from previous tabs.

Using mutate on GROUPED dataframes https://dplyr.tidyverse.org/reference/mutate.html

Taken from website above:

#Because mutating expressions are computed within groups, they may yield different results on grouped tibbles. This will be the case as #soon as an aggregating, lagging, or ranking function is involved. Compare this ungrouped mutate:

starwars %>%
  select(name, mass, species) %>%
  mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
With the grouped equivalent:

starwars %>%
  select(name, mass, species) %>%
  group_by(species) %>%
  mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
The former normalises mass by the global average whereas the latter normalises by the averages within species levels.

Fix classes

CAUTION: This tab may follow from previous tabs.

See section on object classes

Often you will need to set the correct class for a column. The most common approach is to use mutate() to define the column as itself, but with a different class.

First we run some checks on the classes of important columns.

The class of the “age” column is character. To perform analysis, we need those numbers to be recognized as numeric!

class(linelist$age)
## [1] "character"

The class of the “date_onset” column is also character! To perform analysis, these dates must be recognized as dates!

class(linelist$date_onset)
## [1] "character"

However, if we try to classify this column as date, we would get an error. Use table() or sort or another method to examine all the values and identify different one. For example in our dataset we see that we see that one date_onset value was entered in a different format (15th April 2014) than all the other values!

## 
## 15th April 2014      2012-05-27      2012-06-06      2012-06-15      2012-06-17 
##               1               1               1               1               3 
##      2012-06-20 
##               1

Before we can classify “date_onset” as a date, this value must be fixed to be the same format as the others. You can fix the date in the source data, or, we can do in the cleaning pipeline via mutate() and recode(). This must be done before the commands to convert to class Date. (LINK TO DATE SECTION).

The mutate() line can be read as: “mutate date_onset to equal date_onset recoded so that OLD VALUE is changed to NEW VALUE”. Note that this pattern (OLD = NEW) for recode() is the opposite of most R patterns (new = old). The R development community is working on revising this for recoding.

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# remove 2nd row, which contains data dictionary values
#######################################################
    # store column names
    linelist_raw_names <- import(here::here("data", "ebola_simulated.xlsx")) %>% names()
    
    # import raw dataset, skipping 2nd row (data dictionary row) and re-attaching column names
    linelist_raw <- import(here::here("data", "ebola_simulated.xlsx"), skip = 2, col_names = linelist_raw_names)

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
    select(-c(row_num, merged_header, x23)) %>% 

# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
  ###################################################
  # fix incorrect values                 # old value       # new value
  mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
  
  # correct the class of the columns
  mutate(across(contains("date"), as.Date), 
         generation = as.numeric(generation),
         age        = as.numeric(age))
         
         

Especially after converting to class date, check your data visually or with table() to confirm that they were converted correctly! For as.Date(), the format = argument is often a source of errors.

Fix class for multiple columns at once

class(linelist$date_infection)
## [1] "Date"
head(linelist$date_infection)
## [1] "2014-04-09" NA           NA           "2014-05-07" NA          
## [6] "2014-05-06"

You can use The dplyr function across() with mutate() to convert several columns at once to a new class. across() allows you to specify which columns you want a function to apply to. Below, we want to mutate the columns where is.POSIXct() (a type of date/time class that shows unnecessary timestamps) is TRUE, and apply the function is.Date() to them, in order to convert them to class “date”.

  • Note that within across() we also use the function where().
  • Note that is.POSIXct is from the package lubridate. Other similar functions (is.character(), is.numeric(), and is.logical()) are from base R
  • Note that the functions in across() are written without the empty parentheses ()
linelist <- linelist %>% 
  mutate(across(where(lubridate::is.POSIXct), as.Date))

Recoding values

blah blah blah TO DO

Manual recoding

Manual using mutate()

mutate() is also used to recode the values in a column. For example, in linelist the values in the column “hospital” must be cleaned. There are several different spellings (often the word “Hospital” is missing an “s” and is written “Hopital”), and many missing values.

table(linelist$hospital, useNA = "always")
## 
##                      Central Hopital                     Central Hospital 
##                                   12                                  446 
##                           Hospital A                           Hospital B 
##                                   54                                   54 
##                     Military Hopital                    Military Hospital 
##                                   31                                  805 
##                     Mitylira Hopital                    Mitylira Hospital 
##                                    1                                   79 
##                                Other                         Port Hopital 
##                                  906                                   47 
##                        Port Hospital St. Mark's Maternity Hospital (SMMH) 
##                                 1767                                  414 
##   St. Marks Maternity Hopital (SMMH)                                 <NA> 
##                                   12                                 1499

Manual using recode()

To change spellings manually, one-by-one, you can use the recode() function within the mutate function. The code is saying that the column “hospital” should be defined as the current column “hospital”, but with certain changes (the syntax is OLD = NEW). Don’t forget commas!

linelist <- linelist %>% 
  mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      ))

Now we see the values in the hospital column have been corrected:

table(linelist$hospital, useNA = "always")
## 
##                     Central Hospital                           Hospital A 
##                                  458                                   54 
##                           Hospital B                    Military Hospital 
##                                   54                                  916 
##                                Other                        Port Hospital 
##                                  906                                 1814 
## St. Mark's Maternity Hospital (SMMH)                                 <NA> 
##                                  426                                 1499

TIP: The number of spaces before and after an equals sign does not matter. Make your code easier to read by aligning the = for all or most rows. Also, consider adding a hashed comment row to clarify for future readers which side is OLD and which side is NEW.

TIP: Sometimes a blank character value exists in a dataset (not recognized as R’s value for missing - NA). You can reference this value with two quotation marks with no space inbetween ("").

Manual using base R

If you need to write a stand-alone command using base R (e.g. not part of a chain of dplyr verbs), then you can create a new column by assigning it a value. In the command below, the column new_var does not exist until after the command is executed. In this simple example the column is assigned the static value “new value”, so for all rows the value will be “new value”.

linelist_raw$new_var <- "new value"

If necessary, you make manual changes to a specific value in a dataframe by referencing the row number of case ID. But remember it is better if you can make these changes permanently in the underlying data!

Here is a fake example. It reads as “Change the value of the dataframe linelist‘s column onset_date (for the row where linelist’s column case_id has the value ’9d4019’) to as.Date("2020-10-24")”.

linelist$date_onset[linelist$case_id == "9d4019"] <- as.Date("2020-10-24")

Recoding by logic/condition

Logic using case_when()

If you need to use logic statements to recode values, or want to use operators like %in%, use dplyr’s case_when() instead. If you use case_when() please read the thorough explanation HERE LINK, as there are important differences from recode() in syntax and logic order!

Note that all Right-hand side (RHS) inputs must be of the same class (e.g. character, numeric, logical). Notice the use of the special value NA_real_ instead of just NA.

linelist <- linelist %>% 
  mutate(age_years = case_when(
            age_unit == "years"  ~ age,     # if age is given in years
            age_unit == "months" ~ age/12,  # if age is given in months
            is.na(age_unit)      ~ age,     # if age unit is missing, assume years
            TRUE                 ~ NA_real_)) # Any other circumstance

Logic using ifelse() and if_else()

For simple uses of logical re-coding or new variable creationgyou can use ifelse() or if_else(). Though in most cases it is better to use case_when().

These commands are simplified versions of an if and else statement. The general syntax is ifelse(condition, value if condition evaluates to TRUE, value if condition evaluates to FALSE). If used in a mutate(), each row is evaluated. if_else() is a special version from dplyr that handles dates in the condition.

It can be tempting to string together many ifelse commands… resist this and use case_when() instead! It is much more simple, easier to read, and easier to identify errors.

IMAGE of ifelse string with X across is.

You can reference other columns with the ifelse() function within mutate():

Example of ifelse():

linelist <- linelist %>% 
  mutate(source_known = ifelse(!is.na(source), "known", "unknown"))

Example of if_else() (using dates): Note that if the ‘true’ value is a date, the ‘false’ value must also qualify a date, hence using the special character NA_real_ instead of just NA.

linelist <- linelist %>% 
  mutate(date_death = if_else(outcome == "Death", date_outcome, NA_real_))

Note: If you want to alternate a value used in your code based on other circumstances, consider using switch() from base R. For example if… TO DO. See the section on using switch() in the page on R interactive console.

Recoding using special dplyr functions

Using replace_na()

To change missing values (NA) to a specific character value, such as “Missing”, use the function replace_na() within mutate(). Note that this is used in the same manner as recode above - the name of the variable must be repeated within replace_na().

linelist <- linelist %>% 
  mutate(hospital = replace_na(hospital, "Missing"))

Using na_if()

Likewise you can quickly convert a specific character value to NA using na_if(). The command below is the opposite of the one above. It converts any values of “Missing” to NA.

linelist <- linelist %>% 
  mutate(hospital = na_if(hospital, "Missing"))

Using coalesce()

This dplyr function finds the first non-missing value at each position. So, you provide it with columns and for each row it will fill the value with the first non-missing value in the columns you provided.

For example, you might use thiscoalesce()` create a “location” variable from hypothetical variables “patient_residence” and “reporting_jurisdiction”, where you prioritize patient residence information, if it exists.

linelist <- linelist %>% 
  mutate(location = coalesce(patient_residence, reporting_jurisdiction))

TO DO lead(), lag() cumsum(), cummean(), cummin(), cummax(), cumany(), cumall(),

Recoding using cleaning dictionaries

CAUTION: This tab may follow from previous tabs.

## load cleaning rules and only keep columns in mll
mll_cleaning_rules <- import(here("dictionaries/mll_cleaning_rules.xlsx")) %>%
  filter(column %in% c(names(mll_raw), ".global"))

## define columns that are not cleand
unchanged <- c(
  "epilink_relationship",
  "narratives",
  "epilink_relationship_detail"
)

mll_clean <- mll_raw %>%
  ## convert to tibble
  as_tibble() %>%
  ## clean columns using cleaning rules
  clean_data(
    wordlists = mll_cleaning_rules,
    protect = names(.) %in% unchanged
  )

Add recoding to the pipe chain

Here we add the described cleaning steps to the pipe chain.

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# remove 2nd row, which contains data dictionary values
#######################################################
    # store column names
    linelist_raw_names <- import(here::here("data", "ebola_simulated.xlsx")) %>% names()
    
    # import raw dataset, skipping 2nd row (data dictionary row) and re-attaching column names
    linelist_raw <- import(here::here("data", "ebola_simulated.xlsx"), skip = 2, col_names = linelist_raw_names)

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
    select(-c(row_num, merged_header, x23)) %>% 

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
    
    # correct the class of the columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 

# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
  ###################################################

    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      )) %>% 
    
    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years" ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit) ~ age,
          TRUE ~ NA_real_))

Filter rows

CAUTION: This tab may follow from previous tabs.

A typical early cleaning step is to filter the dataframe for specific rows using the dplyr verb filter(). Within filter(), give the logic that must be TRUE for a row in the dataset to be kept.

The tabs below show how to filter rows based on simple and complex logical conditions, and how to filter/subset rows as a stand-alone command and with base R

A simple filter()

A simple example applies a filter command within a pipe chain. The command re-defines the dataframe linelist as itself having filtered the rows to meet a logical condition. Only the rows where the logical statement within the parentheses is TRUE are kept.
In this case, the logical statement is !is.na(case_id), which is asking whether the value in the column case_id is not missing (NA). Thus, rows where case_id is not missing are kept.

Before the filter is applied, the number of rows in linelist is 6127.

linelist <- linelist %>% 
  filter(!is.na(case_id))  # keep only rows where case_id is not missing

After the filter is applied, the number of rows in linelist is 6122.

A complex filter()

A more complex example using filter():

Examine the data

Below is a simple one-line command to create a histogram of onset dates. See that a second smaller outbreak from 2012-2013 is also included in this dataset. For our analyses, we want to remove entries from this earlier outbreak.

hist(linelist$date_onset, breaks = 50)

#### Be aware how filters handle missing numeric and date values

Can we just filter by date_onset to rows after June 2013? Caution! Applying filter(date_onset > as.Date("2013-06-01"))) would accidentally remove any rows in the later epidemic with a missing date of onset!

DANGER: Filtering to greater than (>) or less than (<) a date or number can remove any rows with missing values (NA)! This is because NA is treated as infinitely large and small.

Design the filter

Examine a cross-tabulation to make sure we exclude only the correct rows:

table(Hospital  = linelist$hospital,                     # hospital name
      YearOnset = lubridate::year(linelist$date_onset),  # year of date_onset
      useNA     = "always")                              # show missing values
##                                       YearOnset
## Hospital                               2012 2013 2014 2015 <NA>
##   Central Hospital                        0    0  345   93   20
##   Hospital A                             41   12    0    0    0
##   Hospital B                             39   13    0    0    2
##   Military Hospital                       0    0  672  201   43
##   Missing                                 0    0 1122  309   64
##   Other                                   0    0  696  168   42
##   Port Hospital                           9    1 1380  347   77
##   St. Mark's Maternity Hospital (SMMH)    0    0  325   88   13
##   <NA>                                    0    0    0    0    0

What other criteria can we filter on to remove the first outbreak from the dataset? We see that:

  • The first epidemic occurred at Hospital A, Hospital B, and that there were also 10 cases at Port Hospital.
  • Hospitals A & B did not have cases in the second epidemic, but Port Hospital did.

We want to exclude:

  • The 117 rows with onset in 2012 and 2013 at hospitals A, B, and Port, including:
    • Excluding the 2 rows from Hospitals A & B with missing onset dates
    • Do not exclude any other rows with missing onset dates.

We start with a linelist of nrow(linelist). Here is our filter statement:

linelist <- linelist %>% 
  # keep rows where onset is after 1 June 2013 OR where onset is missing and it was a hospital OTHER than Hospital A or B
  filter(date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))

nrow(linelist)
## [1] 6005

When we re-make the cross-tabulation, we see that Hospitals A & B are removed completely, and the 10 Port Hospital cases from 2012 & 2013 are removed, and all other values are the same - just as we wanted.

table(Hospital  = linelist$hospital,                     # hospital name
      YearOnset = lubridate::year(linelist$date_onset),  # year of date_onset
      useNA     = "always")                              # show missing values
##                                       YearOnset
## Hospital                               2014 2015 <NA>
##   Central Hospital                      345   93   20
##   Military Hospital                     672  201   43
##   Missing                              1122  309   64
##   Other                                 696  168   42
##   Port Hospital                        1380  347   77
##   St. Mark's Maternity Hospital (SMMH)  325   88   13
##   <NA>                                    0    0    0

Multiple statements can be included within one filter command (separated by commas), or you can always pipe to a separate filter() command for clarity.

Note: some readers may notice that it would be easier to just filter by date_hospitalisation because it is 100% complete. This is true. But for pdate_onset is used for purposes of a complex filter example.

Filter as a stand-alone command

Filtering can also be done as a stand-alone command (not part of a pipe chain). Like other dplyr verbs, in this case the first argument must be the dataset itself.

# dataframe <- filter(dataframe, condition(s) for rows to keep)

linelist <- filter(linelist, !is.na(case_id))

You can also use base R to subset using square brackets which reflect the [rows, columns] that you want to retain.

# dataframe <- dataframe[row conditions, column conditions] (blank means keep all)

linelist <- linelist[!is.na(case_id), ]

TIP: Use bracket-subset syntax with View() to quickly review a few records.

Filtering to quickly review data

This base R syntax can be handy when you want to quickly view a subset of rows and columns. Use the base R View() command (note the capital “V”) around the [] subset you want to see. The result will appear as a dataframe in your RStudio viewer panel. For example, if I want to review onset and hospitalization dates of 3 specific cases:

View(linelist[linelist$case_id %in% c("11f8ea", "76b97a", "47a5f5"), c("date_onset", "date_hospitalisation")])

Note: the above command can also be written with dplyr verbs filter() and select() as below:

View(linelist %>% filter(case_id %in% c("11f8ea", "76b97a", "47a5f5")) %>% select(date_onset, date_hospitalisation)

Add the filters to the cleaning pipe chain

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# remove 2nd row, which contains data dictionary values
#######################################################
    # store column names
    linelist_raw_names <- import(here::here("data", "ebola_simulated.xlsx")) %>% names()
    
    # import raw dataset, skipping 2nd row (data dictionary row) and re-attaching column names
    linelist_raw <- import(here::here("data", "ebola_simulated.xlsx"), skip = 2, col_names = linelist_raw_names)

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
        select(-c(row_num, merged_header, x23)) %>% 

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
    
    # correct the class of the columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
    
    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      )) %>% 

    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years"  ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit)      ~ age,
          TRUE                 ~ NA_real_)) %>% 
    
  # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    ###################################################
    filter(
          # keep only rows where case_id is not missing
          !is.na(case_id),  
          
          # also filter to keep only the second outbreak
          date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))

Numeric categories

CAUTION: This tab may follow from previous tabs.

Special approaches for creating numeric categories

Common examples include age categories, groups of lab values, etc.

There are several ways to create categories of a numeric column such as age. Here we will discuss:

  • age_categories(), from the epikit package
  • cut(), from base R
  • using percentiles to break your numbers
  • natural break points… ? TO DO
  • case_when()

Sometimes, numeric variables will import as class “character”. This occurs if there are non-numeric characters in some of the values, for example an entry of “2 months” for age, or (depending on your R locale settings) if a comma is used in the decimals place (e.g. “4,5” to mean four and one half years).

For this example we will create an age_cat column using the age_years column.

#check the class of the linelist variable age
class(linelist$age_years)
## [1] "numeric"

Numeric categories using age_categories() from epikit

With the epikit package, you can use the age_categories() function to easily categorize and label numeric columns (note: this can be applied to non-age numeric variables too). The output is an ordered factor.

The break values specified are included in the higher group, that is groups are open on the lower/left side. As shown below, you can add 1 to each break value to achieve groups that are open at the top/right.

Other optional arguments:

  • lower = Default is 0). The lowest number you want considered.
  • upper = The highest number you want considered.
  • by = The number of years between groups.
  • separator = Default is “-”. Character between ages in labels.
  • ceiling = Default FALSE. If TRUE, the highest break value is a ceiling and a category “XX+” is not included. Any values above highest break or upper (if defined) are categorized as NA.

See the function’s Help page for more details (enter ?age_categories in the R console).

library(epikit)

# Simple example
################
linelist <- linelist %>% 
  mutate(age_cat = age_categories(age_years,
                                  breakers = c(0, 5, 10, 15, 20, 30, 50, 70)))
# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-29 30-49 50-69   70+  <NA> 
##  1172  1130   995   895  1090   605    26     0    92


# With ceiling set to TRUE
##########################
linelist <- linelist %>% 
  mutate(age_cat = age_categories(age_years, 
                                  breakers = c(0, 5, 10, 15, 20, 30, 50, 70),
                                  upper = max(linelist$age_years, na.rm=T),
                                  ceiling = TRUE)) # 70 is the ceiling
# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-29 30-49 50-70  <NA> 
##  1172  1130   995   895  1090   605    26    92


# Include upper ends for the same categories
############################################
linelist <- linelist %>% 
  mutate(age_cat = age_categories(age_years, 
                                  upper = max(linelist$age_years, na.rm=T),
                                  breakers = c(0, 6, 11, 16, 21, 31, 51, 71, 76)))
# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-5  6-10 11-15 16-20 21-30 31-50 51-70 71-75   76+  <NA> 
##  1413  1098   987   851   996   550    18     0     0    92

Numeric categories using cut()

You can use the base R function cut(), which creates categories from a numeric variable. The differences from age_categories() are:

  • You do not need to install/load another package
  • You can specify whether groups are open/closed on the right/left
  • You must provide labels yourself (and ensure they are accurate to the groups)
  • If you want 0 included in the lowest group you must specify this

The basic syntax within cut() is to first provide the numeric variable to be cut (age_years), and then the breaks argument, which is a numeric vector (c()) of break points. Using cut(), the resulting column is an ordered factor.

If used within mutate() (a dplyr verb) it is not necessary to specify the dataframe before the column name (e.g. linelist$age_years).

cut() example

Create new column of age categories (age_cat) by cutting the numeric age_year column at specified break points. The example below replicates the first age_categories() example.

  • Specify numeric vector of break points c(0, 5, 10, 15, ...)
  • Default behavior for cut() is that lower break values are excluded from each category, and upper break values are included. This is the opposite behavior from the age_categories() function.
  • Include 0 in the lowest category by adding include.lowest = TRUE
  • Add a vector of customized labels using the labels = argument
  • Check your work with cross-tabulation of the numeric and category columns - be aware of missing values
linelist <- linelist %>% 
  mutate(age_cat = cut(age_years,                                       # numeric column
                        breaks = c(0, 5, 10, 15, 20, 30, 50, 70,        # break points...
                                   max(linelist$age_years, na.rm=T)),   # ... with dynamic last break as column max value
                        right = TRUE,                                   # lower breaks included and upper excluded [a,b)
                        include.lowest = TRUE,                          # 0 included in lowest category
                        labels = c("0-4", "5-9", "10-14", "15-19",      # manual labels - be careful!
                                   "20-29", "30-49", "50-69", "70+")))       

table(linelist$age_cat, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-29 30-49 50-69   70+  <NA> 
##  1413  1098   987   851   996   550    18     0    92

cut() in detail

Below is a detailed description of the behavior of using cut() to make the age_cat column. Key points:

  • Inclusion/exclusion behavior of break points
  • Custom category labels
  • Handling missing values
  • Check your work!

The most simple command of cut() applied to age_years to make the new variable age_cat is below:

# Create new variable, by cutting the numeric age variable
# by default, upper break is excluded and lower break excluded from each category
linelist <- linelist %>% 
  mutate(age_cat = cut(age_years, breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100)))

# tabulate the number of observations per group
table(linelist$age_cat, useNA = "always")
## 
##    (0,5]   (5,10]  (10,15]  (15,20]  (20,30]  (30,50]  (50,70] (70,100] 
##     1285     1098      987      851      996      550       18        0 
##     <NA> 
##      220
  • By default, the categorization occurs so that the right/upper side is “open” and inclusive (and the left/lower side is “closed” or exclusive). The default labels use the notation “(A, B]”, which means the group does not include A (the lower break value), but includes B (the upper break value). Reverse this behavior by providing the right = TRUE argument.

  • Thus, by default “0” values are excluded from the lowest group, and categorized as NA. “0” values could be infants coded as age 0. To change this add the argument include.lowest = TRUE. Then, any “0” values are included in the lowest group. The automatically-generated label for the lowest category will change from “(0,B]” to “[0,B]”, which signifies that 0 values are included.

  • Check your work!!! Verify that each age value was assigned to the correct category by cross-tabulating the numeric and category columns. Examine assignment of boundary values (e.g. 15, if neighboring categories are 10-15 and 15-20).

# Cross tabulation of the numeric and category columns. 
table("Numeric Values" = linelist$age_years,   # names specified in table for clarity.
      "Categories"     = linelist$age_cat,
      useNA = "always")                        # don't forget to examine NA values
##                     Categories
## Numeric Values       (0,5] (5,10] (10,15] (15,20] (20,30] (30,50] (50,70]
##   0                      0      0       0       0       0       0       0
##   0.0833333333333333     1      0       0       0       0       0       0
##   0.166666666666667      1      0       0       0       0       0       0
##   0.25                   1      0       0       0       0       0       0
##   0.333333333333333      1      0       0       0       0       0       0
##   0.416666666666667      1      0       0       0       0       0       0
##   0.5                    3      0       0       0       0       0       0
##   0.583333333333333      3      0       0       0       0       0       0
##   0.666666666666667      2      0       0       0       0       0       0
##   0.75                   3      0       0       0       0       0       0
##   0.833333333333333      4      0       0       0       0       0       0
##   0.916666666666667      3      0       0       0       0       0       0
##   1                    276      0       0       0       0       0       0
##   1.5                    1      0       0       0       0       0       0
##   2                    244      0       0       0       0       0       0
##   3                    259      0       0       0       0       0       0
##   4                    241      0       0       0       0       0       0
##   5                    241      0       0       0       0       0       0
##   6                      0    230       0       0       0       0       0
##   7                      0    243       0       0       0       0       0
##   8                      0    198       0       0       0       0       0
##   9                      0    218       0       0       0       0       0
##   10                     0    209       0       0       0       0       0
##   11                     0      0     185       0       0       0       0
##   12                     0      0     208       0       0       0       0
##   13                     0      0     187       0       0       0       0
##   14                     0      0     206       0       0       0       0
##   15                     0      0     201       0       0       0       0
##   16                     0      0       0     183       0       0       0
##   17                     0      0       0     173       0       0       0
##   18                     0      0       0     168       0       0       0
##   19                     0      0       0     170       0       0       0
##   20                     0      0       0     157       0       0       0
##   21                     0      0       0       0     130       0       0
##   22                     0      0       0       0     134       0       0
##   23                     0      0       0       0     124       0       0
##   24                     0      0       0       0     104       0       0
##   25                     0      0       0       0     102       0       0
##   26                     0      0       0       0      93       0       0
##   27                     0      0       0       0     104       0       0
##   28                     0      0       0       0      77       0       0
##   29                     0      0       0       0      65       0       0
##   30                     0      0       0       0      63       0       0
##   31                     0      0       0       0       0      61       0
##   32                     0      0       0       0       0      67       0
##   33                     0      0       0       0       0      59       0
##   34                     0      0       0       0       0      44       0
##   35                     0      0       0       0       0      45       0
##   36                     0      0       0       0       0      46       0
##   37                     0      0       0       0       0      31       0
##   38                     0      0       0       0       0      33       0
##   39                     0      0       0       0       0      24       0
##   40                     0      0       0       0       0      20       0
##   41                     0      0       0       0       0      23       0
##   42                     0      0       0       0       0       9       0
##   43                     0      0       0       0       0      22       0
##   44                     0      0       0       0       0      12       0
##   45                     0      0       0       0       0      10       0
##   46                     0      0       0       0       0       6       0
##   47                     0      0       0       0       0      11       0
##   48                     0      0       0       0       0      13       0
##   49                     0      0       0       0       0       6       0
##   50                     0      0       0       0       0       8       0
##   51                     0      0       0       0       0       0       2
##   52                     0      0       0       0       0       0       5
##   53                     0      0       0       0       0       0       5
##   54                     0      0       0       0       0       0       1
##   56                     0      0       0       0       0       0       2
##   57                     0      0       0       0       0       0       1
##   59                     0      0       0       0       0       0       1
##   60                     0      0       0       0       0       0       1
##   <NA>                   0      0       0       0       0       0       0
##                     Categories
## Numeric Values       (70,100] <NA>
##   0                         0  128
##   0.0833333333333333        0    0
##   0.166666666666667         0    0
##   0.25                      0    0
##   0.333333333333333         0    0
##   0.416666666666667         0    0
##   0.5                       0    0
##   0.583333333333333         0    0
##   0.666666666666667         0    0
##   0.75                      0    0
##   0.833333333333333         0    0
##   0.916666666666667         0    0
##   1                         0    0
##   1.5                       0    0
##   2                         0    0
##   3                         0    0
##   4                         0    0
##   5                         0    0
##   6                         0    0
##   7                         0    0
##   8                         0    0
##   9                         0    0
##   10                        0    0
##   11                        0    0
##   12                        0    0
##   13                        0    0
##   14                        0    0
##   15                        0    0
##   16                        0    0
##   17                        0    0
##   18                        0    0
##   19                        0    0
##   20                        0    0
##   21                        0    0
##   22                        0    0
##   23                        0    0
##   24                        0    0
##   25                        0    0
##   26                        0    0
##   27                        0    0
##   28                        0    0
##   29                        0    0
##   30                        0    0
##   31                        0    0
##   32                        0    0
##   33                        0    0
##   34                        0    0
##   35                        0    0
##   36                        0    0
##   37                        0    0
##   38                        0    0
##   39                        0    0
##   40                        0    0
##   41                        0    0
##   42                        0    0
##   43                        0    0
##   44                        0    0
##   45                        0    0
##   46                        0    0
##   47                        0    0
##   48                        0    0
##   49                        0    0
##   50                        0    0
##   51                        0    0
##   52                        0    0
##   53                        0    0
##   54                        0    0
##   56                        0    0
##   57                        0    0
##   59                        0    0
##   60                        0    0
##   <NA>                      0   92

Read more about cut() in its Help page by entering ?cut in the R console.

Reversing break inclusion behavior

Lower break values will be included in each category (and upper break values excluded) if the argument right = is included and and set to TRUE. This is applied below - note how the values have shifted among the categories.

NOTE: If you include the include.lowest = TRUE argument and right = TRUE, the include.lowest will now apply to the highest break point value and category, not the lowest.

linelist <- linelist %>% 
  mutate(age_cat = cut(age_years,
                          breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),     # same breaks
                          right = FALSE,                                     # include each *lower* break point            
                          labels = c("0-4", "5-9", "10-14", "15-19",
                                     "20-29", "30-49", "50-69", "70-100")))  # now the labels must change

table(linelist$age_cat, useNA = "always")
## 
##    0-4    5-9  10-14  15-19  20-29  30-49  50-69 70-100   <NA> 
##   1172   1130    995    895   1090    605     26      0     92

Re-labeling NA values with cut()

Because cut() does not automatically label NA values, you may want to assign a label such as “Missing”. This requires a few extra steps because cut() automatically classified the new column age_cat as a Factor (a rigid column class with specific value labels).

First, convert age_cut from Factor to Character class, so you have flexibility to add new character values (e.g. “Missing”). Otherwise you will encounter an error. Then, use the dplyr verb replace_na() to replace NA values with a character value like “Missing”. These steps can be combined into one step, as shown below.

Note that Missing has been added, but the order of the categories is now wrong (alphabetical).

linelist <- linelist %>% 
  
  # cut() creates age_cat, automatically of class Factor      
  mutate(age_cat = cut(age_years,
                          breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),          
                          right = FALSE,                                                      
                          labels = c("0-4", "5-9", "10-14", "15-19",
                                     "20-29", "30-49", "50-69", "70-100")),
         
         # convert to class Character, and replace NA with "Missing"
         age_cat = replace_na(as.character(age_cat), "Missing"))


table(linelist$age_cat, useNA = "always")
## 
##     0-4   10-14   15-19   20-29   30-49     5-9   50-69 Missing    <NA> 
##    1172     995     895    1090     605    1130      26      92       0

To fix this, re-convert age_cat to a factor, and define the order of the levels correctly.

linelist <- linelist %>% 
  
  # cut() creates age_cat, automatically of class Factor      
  mutate(age_cat = cut(age_years,
                          breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),          
                          right = FALSE,                                                      
                          labels = c("0-4", "5-9", "10-14", "15-19",
                                     "20-29", "30-49", "50-69", "70-100")),
         
         # convert to class Character, and replace NA with "Missing"
         age_cat = replace_na(as.character(age_cat), "Missing"),
         
         # re-classify age_cat as Factor, with correct level order and new "Missing" level
         age_cat = factor(age_cat, levels = c("0-4", "5-9", "10-14", "15-19", "20-29",
                                              "30-49", "50-69", "70-100", "Missing")))    
  

table(linelist$age_cat, useNA = "always")
## 
##     0-4     5-9   10-14   15-19   20-29   30-49   50-69  70-100 Missing    <NA> 
##    1172    1130     995     895    1090     605      26       0      92       0

If you want a fast way to make breaks and labels, you can use something like below (adjust to your specific situation). See the page on using seq() and rep() and c() TO DO

# Make break points from 0 to 90 by 5
age_seq = seq(from = 0, to = 90, by = 5)
age_seq

# Make labels for the above categories, assuming default cut() settings
age_labels = paste0(age_seq+1, "-", age_seq + 5)
age_labels

# check that both vectors are the same length
length(age_seq) == length(age_labels)

# # Use them in the cut() command
# cut(linelist$age, breaks = age_seq, labels = age_labels)

Numeric categories using case_when()

The dplyr function case_when() can also be used to create numeric categories.

  • Allows explicit setting of break point inclusion/exclusion
  • Allows designation of label for NA values in one step
  • More complicated code, arguably more prone to error
  • Allow more flexibility to include other variables in the logic

If using case_when() please review the in-depth page on it, as the logic and order of assignment are important understand to avoid errors.

CAUTION: In case_when() all right-hand side values must be of the same class. Thus, if your categories are character values (e.g. “20-30 years”) then any designated outcome for NA age values must also be character (“Missing”, or the special NA_character_ instead of NA).

You will need to designate the column as a factor (by wrapping case_when() in the function factor()) and provide the ordering of the factor levels using the levels = argument after the close of the case_when() function. When using cut(), the factor and ordering of levels is done automatically.

linelist <- linelist %>% 
  mutate(age_cat = factor(case_when(
          # provide the case_when logic and outcomes
          age_years >= 0 & age_years < 5     ~ "0-4",          # logic by age_year value
          age_years >= 5 & age_years < 10    ~ "5-9",
          age_years >= 10 & age_years < 15   ~ "10-14",
          age_years >= 15 & age_years < 20   ~ "15-19",
          age_years >= 20 & age_years < 30   ~ "20-29",
          age_years >= 30 & age_years < 50   ~ "30-49",
          age_years >= 50 & age_years < 70   ~ "50-69",
          age_years >= 45 & age_years <= 100 ~ "70-100",
          is.na(age_years)                   ~ "Missing",  # if age_years is missing
          TRUE                               ~ "Check value"   # catch-all alarm to trigger review
          ), levels = c("0-4","5-9", "10-14", "15-19", "20-29", "30-49", "50-69", "70-100", "Missing", "Check value"))
         )


table(linelist$age_cat, useNA = "always")
## 
##         0-4         5-9       10-14       15-19       20-29       30-49 
##        1172        1130         995         895        1090         605 
##       50-69      70-100     Missing Check value        <NA> 
##          26           0          92           0           0

Add numeric categories to pipe chain

Below, code to create two categorical age columns is added to the cleaning pipe chain:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# remove 2nd row, which contains data dictionary values
#######################################################
    # store column names
    linelist_raw_names <- import(here::here("data", "ebola_simulated.xlsx")) %>% names()
    
    # import raw dataset, skipping 2nd row (data dictionary row) and re-attaching column names
    linelist_raw <- import(here::here("data", "ebola_simulated.xlsx"), skip = 2, col_names = linelist_raw_names)

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
        select(-c(row_num, merged_header, x23)) %>% 

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
    
    # correct the class of the columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
    
    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      )) %>% 

    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years" ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit) ~ age,
          TRUE ~ NA_real_)) %>% 
    
    filter(
          # keep only rows where case_id is not missing
          !is.na(case_id),  
          
          # also filter to keep only the second outbreak
          date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B"))) %>% 
  
    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    ###################################################   
    mutate(
          # age categories: custom
          age_cat = epikit::age_categories(age_years, breakers = c(0, 5, 10, 15, 20, 30, 50, 70)),
        
          # age categories: 0 to 85 by 5s
          age_cat5 = epikit::age_categories(age_years, breakers = seq(0, 85, 5)))
        

Highest in hierarchy

CAUTION: This tab may follow from previous tabs.

Within a group, indicate/convert to the highest value in the group

Santa Clara County example - COVID contact tracing data - classification of multiple phone call records from same person into the highest category. (classify all as the highest of the group)

rowwise() dplyr()

https://cran.r-project.org/web/packages/dplyr/vignettes/rowwise.html


linelist <- linelist %>%
  rowwise() %>%
  mutate(num_symptoms = sum(c(fever, chills, cough, aches, vomit) == "yes"))

Transforming multiple variables at once

CAUTION: This tab may follow from previous tabs.

across dplyr

A transformation can be applied to multiple variables at once using the across() function from the package dplyr (contained within tidyverse package).

across() can be used with any dplyr verb, but commonly with as mutate(), filter(), or summarise(). Here are some examples to get started.

across() with mutate():

Change all columns to character class

#to change all columns to character class
linelist <- linelist %>% 
  mutate(across(everything(), as.character))

Change only numeric columns

Here are a few online resources on using across(): Hadley Wickham’s thoughts/rationale

Missing Data

Overview

dealing with missing data percent missing over time etc.

Percent missing over time

Or change in percent of anything (X) over time, really.

lines <- linelist %>%
  mutate(date_of_onset = as.Date(date_of_onset, format = "%d/%m/%Y"),
         week = aweek::week2date(aweek::date2week(date_of_onset))) %>% 
  group_by(week) %>% 
  summarize(n_obs = n(),
            dt_hosp_missing = sum(date_of_hospitalisation == "" | is.na(date_of_hospitalisation)),
            dt_hosp_p_miss = dt_hosp_missing / n_obs,
            
            outcome_missing = sum(outcome == "" | is.na(outcome)),
            outcome_p_miss = outcome_missing / n_obs) %>%
  reshape2::melt(id.vars = c("week")) %>%
  filter(grepl("_p_", variable)) %>% 
  
  ggplot()+
    geom_line(aes(x = week, y = value, group = variable, color = variable), size = 1, stat = "identity")+
    labs(title = "Missingness in variables, as proportion of ",
         #subtitle = str_glue("As of {format(report_date, '%d %b')}"), 
         x = "Week",
         y = "Proportion missing",
         fill = "CalREDIE Variable") + 
    scale_color_discrete(name = "Variable", labels = c("Date of Hospitalization Missing", "Outcome Missing"))+
    scale_y_continuous(breaks = c(seq(0,1,0.1)))
    #theme_cowplot()#+
    #theme(legend.position = element_text("none"))

lines

Pivoting

Overview

(pivoting/melting etc.) Transforming datasets from wide-to-long, or long-to-wide…

Wide-to-long

Transforming a dataset from wide to long

Data

We start with data that is in a wide format, e.g. our linelist.

DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

pivot_longer()

#tidyr::pivot_longer(linelist, dplyr::vars(-age, -date_of_hospitalisation), names_to = "variable", values_to = "value" )

Result

Long-to-wide

dplyr pivot_wider()

Data

Pivot wider

Result

Grouping/aggregating data

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Tidyverse - grouping by values

.drop=F in group_by() command

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

group_by()

aggregate()

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Deduplicating

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

  1. identifying and eliminating duplicate records
  2. “Rolling up” or combining multiple records into one

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Matching & joining datasets

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Joins

antijoins as well

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Probabalistic matching

rowmatcher other options (finlay?)

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Character manipulation/search (regex)

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

  1. Syntax/regex
  2. Splitting string variables
  3. Searcihg ICD codes chief complaints
  4. classifing occupations (Santa Clara County example)

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Tables

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

Manually

From data frame

knitr::kable DT

Summarizing dataframe

From modelresults

For publication

Other

quickly changing the denominator (per 100,000, etc.)

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Descriptive analysis

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.
linelist <- linelist %>% 
  mutate(age = as.numeric(age))

Summary Statistics

Summary of numeric variable

summary(linelist$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00   47.25   60.00   56.91   72.00   91.00       2

Mean

Note the argument na.rm=T, which removes missing values from the calculation.
If missing values are not excluded, the returned value will be NA (missing).

mean(linelist$age)
## [1] NA
mean(linelist$age, na.rm=TRUE)  # with the na.rm= argument set to TRUE
## [1] 56.91045

Median

Note the argument na.rm=T, which removes missing values from the calculation.
If missing values are not excluded, the returned value will be NA (missing).

median(linelist$age)
## [1] NA
median(linelist$age, na.rm=TRUE)  # with the na.rm= argument set to TRUE
## [1] 60

Range

Note the argument na.rm=T, which removes missing values from the calculation.
If missing values are not excluded, the returned value will be NA (missing).

range(linelist$age)
## [1] NA NA
range(linelist$age, na.rm=TRUE)  # with the na.rm= argument set to TRUE
## [1]  2 91

Standard Deviation

Note the argument na.rm=T, which removes missing values from the calculation.
If missing values are not excluded, the returned value will be NA (missing).

sd(linelist$age)
## [1] NA
sd(linelist$age, na.rm=TRUE)  # with the na.rm= argument set to TRUE
## [1] 20.43124

Percentile

Note the argument na.rm=T, which removes missing values from the calculation.
If missing values are not excluded, the returned value will be NA (missing).

stats::quantile(linelist$age, na.rm=TRUE)  # default %iles
##    0%   25%   50%   75%  100% 
##  2.00 47.25 60.00 72.00 91.00
stats::quantile(linelist$age, probs = c(.05, 0.5, 0.75, 0.98), na.rm=TRUE) # percentiles specified
##    5%   50%   75%   98% 
## 12.90 60.00 72.00 86.34

Frequency/cross-tabs

Frequency table of 1 and 2 categorical variables

table(linelist$province)
## 
##     Anhui   Beijing    Fujian Guangdong     Hebei     Henan     Hunan   Jiangsu 
##         4         3         5         1         1         4         2        28 
##   Jiangxi  Shandong  Shanghai    Taiwan  Zhejiang 
##         6         2        33         1        46

x <- table(linelist$province, linelist$gender)
#janitor::adorn_totals(x)

A table with 3 variables

table_3vars <- table(linelist$province, linelist$gender, linelist$outcome)

ftable(table_3vars)
##              Death Recover
##                           
## Anhui     f      1       0
##           m      1       1
## Beijing   f      0       1
##           m      0       1
## Fujian    f      0       0
##           m      0       3
## Guangdong f      0       0
##           m      0       0
## Hebei     f      1       0
##           m      0       0
## Henan     f      0       0
##           m      1       3
## Hunan     f      1       0
##           m      0       1
## Jiangsu   f      2       3
##           m      2       7
## Jiangxi   f      0       2
##           m      1       1
## Shandong  f      0       0
##           m      0       2
## Shanghai  f      3       3
##           m     12      10
## Taiwan    f      0       0
##           m      0       0
## Zhejiang  f      1       3
##           m      5       5

Summary by group

count_data %>% 
  group_by(District) %>% 
  summarise(n_obs = n(), # number of observations
            range_date = max(Date, na.rm=T) - min(Date, na.rm=T)
            )

Correlations

T-tests

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Simple statistical test

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

quant/quant, quant/cat, cat/cat t-tests odds ratios, mantel-haensel, etc.

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Epidemic curves

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

Packages

This code chunk shows the loading of packages required for the analyses.

pacman::p_load(rio,          # File import
               here,         # File locator
               tidyverse,    # data management + ggplot2 graphics
               aweek,        # working with dates
               lubridate,    # Manipulate dates    
               incidence,    # an option for epicurves of linelist data
               stringr,      # Search and manipulate character strings
               forcats,      # working with factors
               RColorBrewer) # Color palettes from colorbrewer2.org

Load data

Two example datasets are used in this section:

  • Linelist of individual cases from a simulated Ebola epidemic
  • Aggregated counts by hospital from the same simulated Ebola epidemic

The dataset is imported using the import() function from the rio package. See the page on importing data for various ways to import data. The linelist and aggregated versions of the data are displayed below.

For most of this document, the linelist dataset will be used. The aggregated counts dataset will be used at the end.

# import the linelist
linelist <- rio::import("linelist_cleaned.xlsx")

Review the two datasets and notice the differences

Case linelist

# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Case counts aggregated by hospital

# display the linelist data as a table
DT::datatable(count_data, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Set parameters

You may want to set certain parameters for production of a report, such as the date for which the data is current (the “data date”). In this case, we set this date as 27 July 2013.

Now we can reference the object data_date into the code and have it reference that date.

## set the report date for the report
## note: can be set to Sys.Date() for the current date
data_date <- as.Date("2015-05-15")

Verify dates

Verify that each relevant date column is class Date and has an appropriate range of values. This for loop prints a histogram for each column.

# create character vector of column names 
DateCols <- as.character(tidyselect::vars_select(names(linelist), matches("date|Date|dt")))

# Produce histogram of each date column
for (Col in DateCols) {     # open loop. iterate for each name in vector DateCols
  hist(linelist[, Col],     # print histogram of the column in linelist dataframe
       breaks = 50,         # number of breaks for the histogram
       xlab = Col)          # x-axis label is the name of the column
  }                         # close the loop

incidence package

Below are tabs on making quick epicurves using the incidence package

CAUTION: Epicontacts expects data to be in a “linelist” format of one row per case (not aggregated). If your data is aggregated counts, look to the ggplot epicurves tab.

TIP: The documentation for plotting an incidence object can be accessed by entering ?plot.incidence in your R console.

https://cran.r-project.org/web/packages/incidence/vignettes/customize_plot.html#example-data-simulated-ebola-outbreak

Simple

The incidence package requires 2 steps to plot an epicurve:

  1. Create an incidence object (using the function incidence())
  2. Plot the incidence object

A simple example - an epicurve of daily cases:

# load package
library(incidence)

# create the incidence object using data by day
epi_day   <- incidence(linelist$date_onset, interval = "day")

# plot the incidence object
plot(epi_day)

#### Change aggregation of cases

The interval defines how the observations are grouped. Available options include all options in the package aweek, including but not limited to:

  • “Monday week”
  • “2 Monday weeks”
  • “Sunday week”
  • “MMWRweek” (starts on Sunday)
  • “Month”
  • “Quarter”
  • “Year”
# Weekly
epi_wk  <- incidence(linelist$date_onset, interval = "Monday week")

# MMWR week
epi_MMWRwk  <- incidence(linelist$date_onset, interval = "MMWRweek")

# Three weeks
epi_3wk  <- incidence(linelist$date_onset, interval = "3 weeks")

# Monthly
epi_month <- incidence(linelist$date_onset, interval = "month")

# Plot the incidence objects (with titles)
plot(epi_wk)+     ggtitle("Monday weeks")
plot(epi_MMWRwk)+ ggtitle("MMWR weeks")
plot(epi_3wk)+    ggtitle("Every 3 weeks")
plot(epi_month)+  ggtitle("Months")

The first date and last date of the plot can also be specified.

Modifications

The incidence package enables modifications in the following ways:

  • Arguments of plot()
  • scale_x_incidence()
  • ggplot() additions

Read details in the Help files by entering ?scale_x_incidence and ?plot.incidence in the R console. Online vignettes are listed in the resources tab.

Aesthetic themes and other ggplot() syntax can be added with the + symbol, because incidence is using ggplot() in the background.

Filtered

To plot the epicurve of a subset of data, filter the data first and then use the subset in the incidence() command. The example below uses data filtered to show only cases at Central Hospital.

# filter the dataset
central_data <- linelist %>% 
  filter(hospital == "Central Hospital")

# create incidence object using female-only data
central_outbreak <- incidence(central_data$date_onset, interval = "week")

# plot
plot(central_outbreak, color = "orange", border = "black", ylab = "Weekly case incidence")

Show individual cases

To show boxes around each individual case, use the argument show_cases = TRUE in the plot() function. This option can be more reader-friendly if used in a smaller outbreak. The boxes can also be applied when dates are aggregated into weeks or other time periods.
The code below creates the weekly epicurve for a smaller outbreak (only cases from Central Hospital), with boxes around each case.

# create filtered dataset for Central Hospital
central_data  <- linelist %>% 
  filter(hospital == "Central Hospital")

# create incidence object (weekly)
central_outbreak <- incidence(central_data$date_onset, interval = "week")

# plot outbreak, showing each case
plot(central_outbreak,
     show_cases = T)                 # show boxes around individual cases

The same epicurve showing individual cases, but with many other aesthetic modifications:

# add plot() arguments and ggplot() commands
plot(central_outbreak,
     show_cases = T,                 # show boxes around each individual case
     color = "lightblue",            # color inside boxes
     border = "darkblue",            # border color of boxes
     alpha = 0.5,                    # transparency
     xlab = "Date of onset",         # x-axis label
     ylab = "Weekly reported cases")+ # y-axis label
  
  # ggplot() commands added to the plot
  scale_x_date(expand = c(0,0),
               date_breaks = "3 weeks",
               date_minor_breaks = "week",
               date_labels = "%d\n%b")+                  # adjust how dates are displayed
  scale_y_continuous(expand = c(0,0), 
                     breaks = seq(0, 35, 5))+           # adjust y-axis intervals
  theme_minimal()+                                      # simplify background
  theme(axis.title = element_text(size = 12, face = "bold"),
        plot.caption = element_text(face = "italic", hjust = 0))+
  labs(#x = "", 
       #y = "", 
       title = "Weekly case incidence at Central Hospital",
       #subtitle = "",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

Adjusting axes

First, you must realize that axis labels are independent from the aggregation of the data into bars.

Modify the bars
The aggregation of data into bars occurs when you set the interval = when creating the incidence object:

central_outbreak <- incidence(central_data$date_onset, interval = "week")

The options for the interval come from the package aweek and include “day”, “Monday week”, “Sunday week”, “month” etc.

Modify date label frequency/format/gridlines

To do this, add command scale_x_incidence() with a +, as you would a ggplot() command.

Type ?scale_x_incidence into the R console to see more information.

n_breaks make_breaks() - auto evenly spaced breaks from object that always align with bins and start on first observed incidence. labels_week - YYYY-Www if weekly default T

  1. Use scale_x_date() if your x-axis is a date Class column.
  • Use the date_breaks = argument to designate frequency of labels (e.g. “1 day”, “week”, “3 weeks”, “month”, “year”) - weeks will start on Mondays while months and years start on their 1st day
  • Add the argument expand = c(0,0) to start the labels at the first bar. Otherwise, the first label will appear to shift depending on your specified frequency, as it will start from a point before to the first bar.
  • Use the date_minor_breaks = argument with same syntax as above to designate frequency of vertical lines
  • Use the date_labels = argument to provide a format for how the date labels should appear. See the Dates page for tips. Tip: use \n for a new line, %W for ISO week number, and %A for year.
  • If you desire Sunday weeks, you must instead use the breaks =, minor_breaks = arguments and manually specify a sequence of dates for the breaks. You can still use date_labels = for formatting.
  1. *Note: if using aggregated counts (for example an epiweek x-axis) your x-axis may not be Date class and may require use scale_x_discrete() instead of scale_x_date() - see ggplot tips page for more details.

Note in the previous plot that the grey gridlines align exactly with the boxes around each case. This is because the gridlines (date_minor_breaks) are specified to appear weekly, and the default week begins on Mondays. The incidence object central_outbreak was also originally created using “Monday week” - so they align. Yet you can imagine how these components could be mis-aligned if attempting to use Sunday weeks, or gridlines at the 1st of each month, for example.

show the 1st of a month, whereas the case boxes are by week. Weeks may overlap the start of months. To align them, include labels_week = T in plot() and do not specify a scale_x_date().

TIP: While the data may be grouped by day or week, you can quickly reduce the number of labels appearing on the x-axis using scale_x_incidence() and specifying the number of axis breaks.

DANGER: Be cautious setting the y-axis scale breaks (e.g. 0 to 30 by 5: seq(0, 30, 5)). Static numbers can cut-off your data if the data changes!.

To enforce a x-axis label for every bin (daily, weekly, etc.), set the n_breaks = argument in plot() to the number of rows in your incidence object: n_breaks = nrow(epi_wk)

scale_x_date(date_breaks = , date_labels = ) discuss. TO DO

Here is how to produce a weekly epicurve using incidence for Sunday weeks:

# create incidence object (specifying SUNDAY weeks)
central_outbreak <- incidence(central_data$date_onset, interval = "Sunday week") # equivalent to "MMWR week" (see US CDC)

# plot() the incidence object
plot(central_outbreak,
     show_cases = T,                 # show boxes around each individual case
     color = "lightblue",            # color inside boxes
     border = "darkblue",            # border color of boxes
     alpha = 0.5,                    # transparency
     xlab = "Date of onset",         # x-axis label
     ylab = "Weekly reported cases", # y-axis label
     n_breaks = 7)+                  # to override auto number of x-axis breaks
  
  # ggplot() commands added to the plot
  ggtitle("Weekly case incidence at Central Hospital")+ # add title separately
  scale_x_date(expand = c(0,0),
               breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                                 to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                                 by   = "3 weeks"),
               minor_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                                       to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                                       by   = "7 days"),
               date_labels = "%d\n%b\n%Y")+             # adjust how dates are displayed
  scale_y_continuous(expand = c(0,0),
                     breaks = seq(0, 30, 5))+           # adjust y-axis intervals
  theme_minimal()+                                      # simplify background
  theme(axis.title = element_text(size = 12, face = "bold"),
        plot.caption = element_text(face = "italic", hjust = 0))+
  labs(#x = "", 
       #y = "", 
       title = "Weekly case incidence at Central Hospital",
       #subtitle = "",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

Color by group

To color the cases by group, provide the column to the groups = argument in the incidence() command. In the example below the cases are colored by their generation in the overall transmission chain.

Note the incidence() argument na_as_group = which if FALSE will prevent missing values (NA) from being categorized as their own group.

# Create object, with data grouped by generation
gen_outbreak <- incidence(linelist$date_onset, 
                               interval = "week", 
                               groups = linelist$generation,
                               na_as_group = FALSE)   # Missing values not assigned their own group


# plot the epicurve
plot(gen_outbreak) 

Here is an example with each case shown, using the smaller outbreak in the data from Central Hospital only (cases colored by gender)

# Create incidence object, data grouped by gender
#################################################

# Classify "gender" column as factor
####################################
# with specific level order and labels, includin for missing values
central_data <- linelist %>% 
  filter(hospital == "Central Hospital") %>% 
  mutate(gender = factor(gender, levels = c(NA, "f", "m"), labels = c("Missing", "Female", "Male"), exclude = NULL))

# Create incidence object, by gender
####################################
gender_outbreak_central <- incidence(central_data$date_onset, 
                                     interval = "week", 
                                     groups = central_data$gender,
                                     na_as_group = TRUE)   # Missing values assigned their own group


# plot standard epicurve
########################
plot(gender_outbreak_central,
     show_cases = T)                                      # incidence: show cases


# plot epicurve with modifications
##################################
plot(gender_outbreak_central,
     show_cases = T)+                                     # incidence: show cases
     
     # ggplot modifications
     # date scale / gridlines
     scale_x_date(expand = c(0,0),
                  date_breaks = "6 weeks",
                  date_minor_breaks = "week",
                  date_labels = "%d %b\n%Y")+
  
     # aesthetic themes
     theme_minimal()+                                              # simplify plot background
     theme(legend.position = "bottom",                             # placement of legend
           legend.title = element_text(size = 14, face = "bold"),  # legend title size, face
           axis.title = element_text(face = "bold"))+              # axis title bold
     
      # plot labels
      labs(fill = "Gender",                               # title of legend
           title = "Show case boxes, with modifications",
           y = "Weekly case incidence",
           x = "Week of symptom onset")      

Changing colors and legend

To change the legend, use ggplot() commands such as: theme(legend.position = "top"), theme(legend.direction = "horizontal"), and theme(legend.title = element_blank()). See the page of ggplot() tips for more details.

To change the color palette, use the argument col_pal in plot() to change the color palette to one of the default base R palettes (do not put the name of the palette in quotes).

Other palettes include TO DO add page with palette names… To DO

plot(gen_outbreak, col_pal = rainbow)

To specify colors manually, provide the name of the color or a character vector of colors to the argument color =. Note to function properly the number of colors must equal the number of groups (be aware of missing values as a group)

hosp_outbreak <- incidence(linelist$date_onset, 
                               interval = "week", 
                               groups = linelist$hospital,
                               na_as_group = FALSE)   # Missing values not assigned their own group
# default colors
plot(hosp_outbreak)

# manual colors
plot(hosp_outbreak, color = c("darkgreen", "darkblue", "purple", "grey", "yellow", "orange"))

Facets/small multiples

To facet the plot by a variable (make “small multiples”), see the tab on epicurves with ggplot()

ggplot2

Below are tabs on using “ggplot2” package to produce epicurves

Unlike using incidence package, you must manually control the aggregation of the data (into weeks, months, etc) and the labels on the date axis. If not carefully managed, this can lead to many headaches.

Simple epicurves from linelist

Assume you begin with a linelist dataset, as shown in the Preparation tab.

Below is perhaps the most simple code to produce daily epicurve with ggplot().

# Daily case counts 
###################
ggplot(data = central_data, aes(x = date_onset)) +  # x column must be class Date
  geom_histogram(binwidth = 1)+                  # date values binned by 1 day 
  labs(title = "Daily")

# weekly
ggplot(data = central_data, aes(x = date_onset)) +                 # x column must be class Date
  geom_histogram(binwidth = 7, color = "black", fill = "white")+    # date values binned by 1 day 
  labs(title = "Weekly")

Below is ggplot() code to produce weekly epicurves for Monday and Sunday weeks, starting from linelist-format data.
See the tab on Modifications (axes) to learn the nuances of date-axis management.

Monday weeks

The below code creates a histogram of the rows, by a date column as the x-axis. Of note:

  • The break points of the histogram’s bins are specified manually to begin the Monday before the earliest case (floor_date() with week_start = 1) and to end the Monday after the last case (ceiling_date()).
  • Date labels on the date axis (remember, independent from the bins) utilize the date_breaks, date_minor_breaks (vertical lines), and date_labels arguments of scale_x_date(). These arguments use weeks starting on Mondays (by default and difficult to change). Depending on the distance between date labels, the location of the last label may vary.
  • Adding expand = c(0,0) to the x and y scales removes excess space on each side of the plot, which also ensures the labels begin at the first bar.
# TOTAL MONDAY WEEK ALIGNMENT
#############################
ggplot(central_data, aes(x = date_onset)) + 
  # make histogram: specify bin break points: starts the Monday before first case, end Monday after last case
  geom_histogram(
    breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
                      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
                      by   = "7 days"), # bins are 7-days
    color = "darkblue",   # color of lines around bars
    fill = "lightblue") + # color of fill within bars
  
  # x-axis labels
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "3 weeks",      # labels appear every 3 Monday weeks
               date_minor_breaks = "week",         # vertical lines appear every Monday week
               date_labels       = "%d\n%b\n'%y")+ # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+             # remove excess y-axis space between bottom of bars and the labels
  
  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"))+               # axis titles in bold
  
  # labels
  labs(title    = "Weekly incidence of cases (Monday weeks)",
       subtitle = "Subtitle: Note alignment of bars, vertical lines, and axis labels on Mondays",
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

Sunday weeks

The below code creates a histogram of the rows, using a date column as the x-axis. Of note:

  • The break points of the histogram’s bins are specified manually to begin the Sunday before the earliest case (floor_date() with week_start = 7) and to end the Sunday after the last case (ceiling_date()). Depending on the distance between date labels, the location of the last label may vary.
  • Date labels on the date axis (remember, independent from the bins) are also manually specified using similar syntax as the break points. You cannot use the scale_x_date() arguments of date_breaks and date_minor_breaks as these use Monday weeks.
  • Adding expand = c(0,0) to the x and y scales removes excess space on each side of the plot, which also ensures the labels begin at the first bar.
# TOTAL SUNDAY WEEK ALIGNMENT
#############################
ggplot(central_data, aes(x = date_onset)) + 
  
  # For histogram, manually specify bin break points: starts the Sunday before first case, end Sunday after last case
  geom_histogram(                    
    breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                      by   = "7 days"), # bins are 7-days
    color = "darkblue",   # color of lines around bars
    fill = "lightblue") + # color of fill within bars
  
  # The labels on the x-axis
  scale_x_date(expand = c(0,0),
               breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                                 to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                                 by   = "3 weeks"),
               minor_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                                       to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                                       by   = "7 days"),
               date_labels = "%d\n%b\n'%y")+             # day, above month abbrev., above 2-digit year
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+                   # removes excess y-axis space between bottom of bars and the labels
  
  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"))+               # axis titles in bold
  
  # labels
  labs(title    = "Weekly incidence of cases (Sunday weeks)",
       subtitle = "Subtitle: Note alignment of bars, vertical lines, and axis labels on Sundays",
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

Modifications

Modify axes

First, you must realize that axis labels are independent from the aggregation of the data into bins/bars.

To modify the bins/bars, you can:

  1. Specify a binwidth = within geom_histogram() - for a column of class Date, the given number is interpreted in days.
  2. Specify the sequence of bin break-points manually by providing a sequence of dates to breaks =. Note, for histogram bins this is within the geom_histogram(), not within scale_x_date() as discussed below for labels.
  3. Group the rows into aggregated counts (by week, month, etc.) and feed the aggregated counts to ggplot(). See the tab on aggregated counts for more information.

To modify the date labels, you can:

  1. Use scale_x_date() if your x-axis is a date Class column.
  • Use the date_breaks = argument to designate frequency of labels (e.g. “1 day”, “week”, “3 weeks”, “month”, “year”) - weeks will start on Mondays while months and years start on their 1st day
  • Add the argument expand = c(0,0) to start the labels at the first bar. Otherwise, the first label will appear to shift depending on your specified frequency, as it will start from a point before to the first bar.
  • Use the date_minor_breaks = argument with same syntax as above to designate frequency of vertical lines
  • Use the date_labels = argument to provide a format for how the date labels should appear. See the Dates page for tips. Tip: use \n for a new line, %W for ISO week number, and %A for year.
  • If you desire Sunday weeks, you must instead use the breaks =, minor_breaks = arguments and manually specify a sequence of dates for the breaks. You can still use date_labels = for formatting.
  1. *Note: if using aggregated counts (for example an epiweek x-axis) your x-axis may not be Date class and may require use scale_x_discrete() instead of scale_x_date() - see ggplot tips page for more details.

Set maximum and minimum date values using limits = c() within scale_x_date(). E.g. scale_x_date(limits = c(as.Date("2014-04-01), NA)) sets a minimum but leaves the maximum open.

CAUTION: Caution using limits! They remove all data outside the limits, which can impact y-axis max/min, modeling, and other statistics. Strongly consider instead using limits by adding coord_cartesian() to your plot, which acts as a “zoom” without removing data.

https://rdrr.io/r/base/strptime.html —– see all % shortcuts

# Auto everything
#################
ggplot(central_data, aes(x = date_onset)) + # x column must be class Date
  geom_histogram() +            
  labs(title = "All defaults",
       subtitle = "! caution: number of days per bin not specified")


# 7-day binwidth and colors specified, axes flush with labels
#################
ggplot(central_data, aes(x = date_onset)) + # x column must be class Date
  geom_histogram(binwidth = 7,                       # 7 days per bin
                 color = "darkblue",                 # color of lines around bars
                 fill = "lightblue") +               # color of bar fill
  scale_x_date(expand = c(0,0))+                     # remove excess space earlier and later than case bars
  scale_y_continuous(expand = c(0,0))+               # remove excess space between axes and labels
  labs(title = "Specified 7-days per bin, colors added, axes excess space removed",
       subtitle = "! caution: 7-day bins begin at first case (Thurs 1 May)\ndefault date_breaks and minor_break vertical lines")


# specify date_breaks 3 WEEKS
#############################
ggplot(central_data, aes(x = date_onset)) +
  geom_histogram(binwidth = 7,
                 color = "darkblue",
                 fill = "lightblue") +
  scale_x_date(expand = c(0,0),
                date_breaks = "3 weeks",       # note: Monday weeks by default
                date_minor_breaks = "week",
                date_labels = "%d\n%b\n'%y")+
  scale_y_continuous(expand = c(0,0))+
  labs(title = "date_labels = '3 weeks' (Mondays), labels formated, with weekly vertical lines",
       subtitle = "! caution: 7-day bins begin at first case (Thurs 1 May)...\n...while 3-weekly date_breaks and weekly minor_break lines show as Mondays (note mis-alignment of bar/axis tick")


# specify date_breaks 4 WEEKS
#############################
ggplot(central_data, aes(x = date_onset)) +
  geom_histogram(binwidth = 7,
                 color = "darkblue",
                 fill = "lightblue") +
  scale_x_date(expand = c(0,0),
                date_breaks = "4 weeks",
                date_minor_breaks = "week",
                date_labels = "%d\n%b\n'%y")+
  scale_y_continuous(expand = c(0,0))+
  labs(title = "date_labels = '4 weeks'",
       subtitle = "! caution: 7-day bins begin at first case (Thursday 1 May)\nwhile 4-weekly date_breaks and weekly minor_break lines show as Mondays\nAlso note shift of first date label")


# specify date_breaks MONTHS (note each bin begins 1st of month)
################################################################
ggplot(central_data, aes(x = date_onset)) +
  geom_histogram(binwidth = 7,
                 color = "darkblue",
                 fill = "lightblue") +
  scale_x_date(expand = c(0,0),
                date_breaks = "months",
                date_minor_breaks = "week",
                date_labels = "%d\n%b\n'%y")+
  scale_y_continuous(expand = c(0,0))+
  labs(title = "date labels = 'months'",
       subtitle = "! caution: 7-day bins begin at first case (Thursday 1 May)...\n...while date_breaks = 'months' shows as 1st of each month...\n...and weekly minor_break lines default to Mondays (note axis ticks don't align with bars and uneven line spacing)")


# TOTAL MONDAY ALIGNMENT: specify manual bin breaks to be mondays
#################################################################
ggplot(central_data, aes(x = date_onset)) + 
  geom_histogram(breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
                                   to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
                                   by   = "7 days"),
                 color = "darkblue",
                 fill = "lightblue") + 
  scale_x_date(expand = c(0,0),
               date_breaks = "3 weeks",
               date_minor_breaks = "week",
               date_labels = "%d\n%b\n'%y")+
  labs(title = "Total Monday alignment: 7-day bins manually set to begin the Monday before first case (28 Apr)",
       subtitle = "date_labels and minor_break lines are Monday weeks by default")


# TOTAL SUNDAY ALIGNMENT: specify manual bin breaks AND labels to be Sundays
############################################################################
ggplot(central_data, aes(x = date_onset)) + 
  geom_histogram(breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                                   to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                                   by   = "7 days"),
                 color = "darkblue",
                 fill = "lightblue") + 
  scale_x_date(expand = c(0,0),
               breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                                   to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                                   by   = "3 weeks"),
               minor_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                                   to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                                   by   = "7 days"),
               date_labels = "%d\n%b\n'%y")+
  labs(title = "Total Sunday alignment: 7-day bins manually set to begin the Sunday before first case (27 Apr)",
       subtitle = "Lines and date labels manually set to Sunday weeks")



# Check values of bars by creating dataframe of grouped values
# central_tab <- central_data %>% 
#   mutate(week = aweek::date2week(date_onset, floor_day = TRUE, factor = TRUE)) %>% 
#   group_by(week, .drop=F) %>%
#   summarize(n = n()) %>% 
#   mutate(groups_3wk = 1:(nrow(central_tab)+1) %/% 3) %>% 
#   group_by(groups_3wk) %>% 
#   summarize(n = n())

Color by groups

Designate a column containing groups

In any of the code template (Sunday weeks, Monday weeks), make the following changes:

  • Add the aesthetics argument aes() within the geom_histogram() (don’t forget comma afterward)
  • Within aes(), provide the group = and fill = arguments with column names (no quotes needed). The values in these respective columns will be the grouping for the stacked bars and their colors.
  • Remove any fill = argument outside of the aes(), as it will override the one inside
geom_histogram(
    aes(group = gender, fill = gender))

Adjust colors:

  • To manually adjust the colors of each group value, use scale_fill_manual() (note scale_color_manual() is different!).
    • Use the values = argument to apply a vector of colors.
    • Use na.value = to specify a color for missing values.
    • ! While you can use the labels = argument in scale_fill_manual() change the legend text labels - it is easy to accidentally give labels in the incorrect order and have an incorrect legend! It is recommended to instead convert the group column to class Factor and designate factor labels and order, as explained below.
  • To adjust the colors via a color scale, see the page on ggplot tips

Adjust the stacking order and Legend

Stacking order, and the labels for each group in the legend, is best adjusted by classifying the group column as class Factor. You can then designate the levels and their labels, and the order (which is reflected in stack order).

Step 1: Convert the group column to class Factor using factor() from base R.
For example, with a column “gender” with values “m” and “f” and NA, this can be put in a mutate() command as:

dataset <- dataset %>% 
  mutate(gender = factor(gender,
                    levels = c(NA, "f", "m"),
                    labels = c("Missing", "Female", "Male"),
                    exclude = NULL))

The above code establishes the levels, in the ordering that missing values are “first” (and will appear on top). Then the labels that will show are given in the same order. Lastly, the exclude statement ensure that NA is included in the ordering (by default factor() ignores NA from the leveling).

Read more about factors in their dedicated handbook page (LINK).

Adjusting the legend

Read more about legends in the ggplot tips page. Here are a few highlights:

  • Adjust the value labels as described above, using Factor labels and order
  • To adjust the order of legend labels independently of the stacking order, use
  • Adjust the legend Title using fill = within labs(). To remove the title completely do not specify fill and instead add legend.title = element_blank() within theme().

These steps are shown in the example below:

########################
# bin break points for histogram defined here for clarity
# starts the Monday before first case, end Monday after last case
bin_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
                      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
                      by   = "7 days") # bins are 7-days

# Set gender as factor and missing values as first level (to show on top)
central_data <- central_data %>% 
  mutate(gender = factor(gender,
                         levels = c(NA, "f", "m"),
                         labels = c("Missing", "Female", "Male"),
                         exclude = NULL))  

# make plot
###########
ggplot(central_data, aes(x = date_onset)) + 
    geom_histogram(
        aes(group = gender, fill = gender),    # arguments inside aes() apply by group
        color = "black",                       # arguments outside aes() apply to all data
        breaks = bin_breaks)+                  # see breaks defined above
                      
  
  # The labels on the x-axis
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "3 weeks",      # labels appear every 3 Monday weeks
               date_minor_breaks = "week",         # vertical lines appear every Monday week
               date_labels       = "%d\n%b\n'%y")+ # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+                   # removes excess y-axis space between bottom of bars and the labels
  
  #scale of colors and legend labels
  scale_fill_manual(values = c("grey", "orange", "purple"))+ # specify fill colors ("values") - attention to order!

  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"))+               # axis titles in bold
  
  # labels
  labs(title    = "Weekly incidence of cases, by gender",
       subtitle = "Subtitle",
       fill     = "Gender",                                      # provide new title for legend
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

Display bars side-by-side

Side-by-side display of group bars (as opposed to stacked) is specified within geom_histogram() with position = "dodge".
If there are more than two value groups, these can become difficult to read. Consider instead using a faceted plot (small multiples) (see tab). To improve readability in this example, missing gender values are removed.

########################
# bin break points for histogram defined here for clarity
# starts the Monday before first case, end Monday after last case
bin_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
                      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
                      by   = "7 days") # bins are 7-days

# New dataset without rows missing gender
central_data_dodge <- central_data %>%
  filter(!is.na(gender)) %>%                            # remove rows missing gender
  mutate(gender = factor(gender,                        # factor now has only two levels (missing not included)
                         levels = c("f", "m"),
                         labels = c("Female", "Male")))  

# make plot
###########
ggplot(central_data_dodge, aes(x = date_onset)) + 
    geom_histogram(
        aes(group = gender, fill = gender),    # arguments inside aes() apply by group
        color = "black",                       # arguments outside aes() apply to all data
        breaks = bin_breaks,
        position = "dodge")+                  # see breaks defined above
                      
  
  # The labels on the x-axis
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "3 weeks",      # labels appear every 3 Monday weeks
               date_minor_breaks = "week",         # vertical lines appear every Monday week
               date_labels       = "%d\n%b\n'%y")+ # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+                   # removes excess y-axis space between bottom of bars and the labels
  
  #scale of colors and legend labels
  scale_fill_manual(values = c("pink", "lightblue"))+     # specify fill colors ("values") - attention to order!

  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"))+               # axis titles in bold
  
  # labels
  labs(title    = "Weekly incidence of cases, by gender",
       subtitle = "Subtitle",
       fill     = "Gender",                                      # provide new title for legend
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

Faceting/small-multiples

########################
# bin break points for histogram defined here for clarity
# starts the Monday before first case, end Monday after last case
bin_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
                      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
                      by   = "7 days") # bins are 7-days

# Set gender as factor and missing values as first level (to show on top)
central_data <- linelist %>% 
  filter(hospital == "Central Hospital") %>%  
  mutate(gender = factor(gender,
                         levels = c(NA, "f", "m"),
                         labels = c("Missing", "Female", "Male"),
                         exclude = NULL))  

# make plot
###########
ggplot(central_data, aes(x = date_onset)) + 
    geom_histogram(
        aes(group = gender, fill = gender),    # arguments inside aes() apply by group
        color = "black",                       # arguments outside aes() apply to all data
        breaks = bin_breaks)+                  # see breaks defined above
                      
  
  # The labels on the x-axis
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "3 weeks",      # labels appear every 3 Monday weeks
               date_minor_breaks = "week",         # vertical lines appear every Monday week
               date_labels       = "%d\n%b\n'%y")+ # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+                   # removes excess y-axis space between bottom of bars and the labels
  
  #scale of colors and legend labels
  scale_fill_manual(values = c("grey", "orange", "purple"))+ # specify fill colors ("values") - attention to order!

  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"))+               # axis titles in bold
  
  # labels
  labs(title    = "Weekly incidence of cases, by gender",
       subtitle = "Subtitle",
       fill     = "Gender",                                      # provide new title for legend
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

Moving average

Add rectangle for tentative reporting

Dual axis

transform and display as aggregate counts

Best is to use the aweek package to create a new column, then group_by and summarize,

Get weeks, then plot use aweek to convert dates to weeks, then plot the weekly counts
To aggregate into weeks and show ALL weeks (even ones with no cases), do these two things:
1) use floor_date = T and factor = T in the date2week() command, and
2) use .drop = F in the group_by() command

like this:

central_tab <- central_data %>% 
  mutate(week = aweek::date2week(date_onset, floor_day = TRUE, factor = TRUE)) %>% 
  group_by(week, .drop=F) %>%
  summarize(n = n())
linelist <- linelist %>% 
  mutate(week = aweek::date2week(date_onset, floor_day = T))

linelist %>% 
  group_by(week, .drop = F) %>% 
  summarise(n = n())
## # A tibble: 57 x 2
##    week         n
##    <aweek>  <int>
##  1 2014-W15     1
##  2 2014-W16     1
##  3 2014-W17     5
##  4 2014-W18     4
##  5 2014-W19    12
##  6 2014-W20    17
##  7 2014-W21    15
##  8 2014-W22    21
##  9 2014-W23    24
## 10 2014-W24    21
## # ... with 47 more rows
# Preparation
#############
# Create epiweek variable. Factor argument automatically includes all weeks in span. Numeric shows just the week number.
linelist$epiweek <- aweek::date2week(linelist$date_onset, factor = TRUE, numeric = TRUE)

# Calculate maximum number of cases in an epiweek, to get the maximum y-axis height (also helps with uniformity in multiple plots)
ymax <- max(summary(factor(linelist$epiweek), maxsum = length(linelist$epiweek)))

# Weekly case counts 
###################
plot_weekly <- ggplot(linelist, aes(x = date_onset)) + 
  
  # stacked bars, bined by week (7 days)
  stat_bin(binwidth = 7, position = "stack", fill = "grey", color = "black") +
  
  # X-axis 21-day labels
  scale_x_date( 
    # Sets date label breaks as every 3 weeks from Monday before the first case
    breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm=T), by = "3 weeks"),
    
    # axis limits determined by max/min + buffer
    limits = c((min(linelist$date_onset, na.rm = T) - 8), (max(linelist$date_onset, na.rm = T) + 8)), 
    
    # displays as date number, then abbreviated month (e.g. 12 Oct)
    date_labels = "%d-%b",   
    
    # sets origin at (0,0)
    expand = c(0,0)) +                                
  
  # Y-axis breaks every 5 cases
  scale_y_continuous(breaks = seq(0, ymax, 25),
                     limits = c(0, ymax),
                     expand = c(0, 0)) +
  
  # Theme specifications (axis, text, etc.)
  theme(# title
        plot.title = element_text(size=20, hjust= 0, face="bold"),   # title size, font, bold
        # axes
        axis.text.x = element_text(angle=90, vjust=0.5, hjust=1),
        axis.text = element_text(size=12),
        axis.title = element_text(size=14, face="bold"),
        axis.line = element_line(colour="black"),
        # background
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        # caption (italics, on right side)
        plot.caption = element_text(hjust = 0, face = "italic")
        ) +
  
  guides(fill = guide_legend(reverse = TRUE,                   # Orders Non-active zones at end of legend
                             override.aes = list(size = 0.2),
                             ncol = 2)) +                        # Number of legend columns
  
  labs(x = "Week of illness onset", 
       y = "Number of cases",
       subtitle = "subtitle here")
  
  ggtitle("Epidemic curve")
## $title
## [1] "Epidemic curve"
## 
## attr(,"class")
## [1] "labels"


# print
print(plot_weekly)

By category

Colored by a category

# Setup
########
# Two known classes (select colors from colorbrewer2.org)
colors_overall = c("#d95f02",  # 
                   "#1b9e77",
                   "#7570b3")  #    

# Order sex variable by reverse # of cases, so plot stacks with smallest # of cases at top
linelist$gender <- factor(linelist$gender, 
                         levels = levels(fct_rev(fct_infreq(linelist$gender))))

# Calculates maximum yaxis height for uniformity between the two graphs
ymax <- max(summary(factor(linelist$epiweek), maxsum = length(linelist$epiweek)))

# Number missing onset_date and cannot be graphed
missing_onset <- nrow(linelist[is.na(linelist$date_onset),])



# PLOT - BY ONSET DATE
######################
plot_defined_cats <- ggplot(linelist, aes(x = date_onset, fill = gender)) + 
  
  # stacked bars, width of 7 days
  stat_bin(binwidth = 7, position = "stack") +
  
  # Colors and labels of confirmed/probable
  scale_fill_manual(values = rev(colors_overall),
                    labels = str_to_sentence(levels(factor(linelist$gender)))) +
  
  # X-axis scale labels (not aggregation, just the labels)
  scale_x_date(# Sets date label breaks as every week
    breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm = T), by = "3 weeks"),
    limits = c((min(linelist$date_onset, na.rm=T)), (max(linelist$date_onset, na.rm = T))), # axis limits determined by max/min + buffer
    date_labels = "%d-%b",   # displays as date # then abbreviated month (e.g. 12 Oct)
    expand = c(0, 0)) +                                # sets origin at (0,0)
  
  # Y-scale in breaks, up to the ymax previously defined
  scale_y_continuous(breaks = seq(0, 500, 25), limits = c(0, ymax), expand=c(0, 0)) +
  
  # Themes for axes, titles, background, etc.
  theme(plot.title       = element_text(size=20, hjust=0.5, face="bold"),
        axis.text        = element_text(size=12),
        axis.title       = element_text(size=14, face="bold"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line        = element_line(colour = "black"),
        axis.text.x      = element_text(angle=90, vjust=0.5, hjust=1)) +
  
  # Legend specifications
  theme(legend.title           = element_blank(),
        legend.justification   = c(0, 1), 
        legend.position        = c(0.09, 0.98),
        legend.background      = element_blank(),
        legend.text            = element_text(size = 12)) +
  guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2))) +
  
  # Axis and caption labels
  labs(x = "Week of illness onset",
       y = "Number of Cases") +
  
  # Title
  ggtitle("Cases by week of illness onset")

# print
print(plot_defined_cats)

By active area

# PARAMETERS
#############

# Maximum y-value for epiweek (this will be larger than necessary because of missing onset dates)
ymax <- max(table(linelist$epiweek))

# Number missing onset_date and cannot be graphed
missing_onset <- nrow(filter(linelist, is.na(date_onset)))


# SETUP - ACTIVE/NON-ACTIVE ZONES
#################################
# List of "active" zones with a case in the date range
active_zones <- unique(linelist$province[which(linelist$date_onset > (data_date - 90))])
active_zones

# Table of active zones and their overall number of cases (for ordering their stacked appearance)
order_table <- linelist %>%
  filter(province %in% active_zones) %>%
  group_by(province) %>%
  summarise(cases = n())
order_table

# Create TRUE/FALSE variable for "active" health zones
linelist$active_zone <- ifelse(linelist$province %in% active_zones, TRUE, FALSE)

# Create list of non-active HZ names for bottom of plot
other_zone_names <- unique(sort(linelist$province[linelist$active_zone == FALSE]))

# Make variable for graph categories, including a level for "non-active" zones
linelist$graph_zone <- factor(case_when(
  
  # Value assignments
  # Non-active zones
  linelist$active_zone == FALSE    ~ "Non-active zones",
  # All others are assigned their names, capitalized
  TRUE  ~ stringr::str_to_title(linelist$province)),
  
  # Order of variable levels
  levels = c(
    # "Non-active zones" is first level
    "Non-active zones",  
    
    # Orders active zones by their frequency in linelist, reversed, so most-affected zones are on the BOTTOM of plot
    str_to_title(rev(levels(fct_infreq(as.factor(linelist$province[linelist$active_zone == TRUE])))))))  


table(linelist$graph_zone, useNA = "ifany")


# COLORS
########
# Number of unique values in graph_zone variable, minus 1 (for non-active, which is added later as grey (#cccccc))
colors_needed <- length(unique(linelist$graph_zone, na.rm=T)) - 1 

# List of possible colors (see colorbrewer2.com, qualitative scheme)
colors_linelist = c(#"#cccccc", # first = non-active grey color
                "#1b9e77", # turquoise green
                "#ff7f00", # orange
                "#ffff33", # yellow
                "#6a3d9a", # purple
                "#b15928", # brown
                "#1f78b4", # blue
                "#e31a1c", # red,
                "#fb9a99", # pink
                "#b2df8a", # light green 
                "#cab2d6", # light purple
                "#a6cee3", # light blue
                "#fdbf6f", # beige
                "#33a02c"  # green
)

# Reduce number of colors to only the number needed
colors_linelist <- c("#cccccc", rev(colors_linelist[1:colors_needed]))


# MAKE GRAPH
#############
plot_overall <- ggplot(linelist, aes(x = date_onset, fill = graph_zone)) + 
  
  # stacked bars, bined by week (7 days)
  stat_bin(binwidth = 7, position = "stack") +
  
  # Fill of bars
  scale_fill_manual(values = colors_linelist, 
                    labels = str_to_sentence(levels(factor(linelist$graph_zone)))) +
  
  # X-axis 21-day labels
  scale_x_date( # Sets date label breaks as every 3 weeks from Monday before the first case
    breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm = T), by = "3 weeks"),
    limits = c((min(linelist$date_onset, na.rm = T) - 8), (max(linelist$date_onset, na.rm = T) + 8)), # axis limits determined by max/min + buffer
    date_labels = "%d-%b",   # displays as date number, then abbreviated month (e.g. 12 Oct)
    expand = c(0,0)) +                                # sets origin at (0,0)
  
  # Y-axis breaks every 5 cases
  scale_y_continuous(breaks = seq(0, ymax, 25),
                     limits = c(0, ymax),
                     expand = c(0, 0)) +
  
  # Theme specifications (axis, text, etc.)
  theme(plot.title = element_text(size = 20, hjust = 0, face = "bold"),   # title size, font, bold
        axis.text.x = element_text(angle=90, vjust=0.5, hjust=1),
        axis.text = element_text(size=12),
        axis.title = element_text(size=14, face="bold"),
        axis.line = element_line(colour="black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        plot.caption = element_text(hjust = 0, face = "italic")
        ) +
  
  # Legend specifications
  theme(legend.title = element_blank(),                        # No legend title
        legend.position = c(0.20, 0.85),                         # placement of legend
        legend.background = element_blank(),                     # legend background   
        legend.text = element_text(size=12)) +                 # legend text size
  
  guides(fill = guide_legend(reverse = TRUE,                   # Orders Non-active zones at end of legend
                             override.aes = list(size = 0.2),
                             ncol = 2)) +                        # Number of legend columns
  
  labs(x = "Week of illness onset", 
       y = "Number of cases",
       subtitle = "Health zones with cases in the last 42 days specified by color",
       caption = paste0(nrow(linelist),
                        " confirmed and probable cases, reported as of ", data_date, ". ",
                        missing_onset, " cases missing date of onset and not shown.",
                        "\nNon-active zones include: ", str_to_title(toString(unique(linelist$province[linelist$active_zone == FALSE]))))) +
  
  ggtitle("Epidemic curve by active health zones")


# print
plot_overall

By grouped areas


#SETUP
#############
# Filter to health zone of interest
zone_data <- linelist

# Number missing onset_date and cannot be graphed
missing_onset <- nrow(filter(linelist, is.na(date_onset)))

# Assign health area groups (individual for HAs of interest, groups others together)
linelist$graph_areas <- factor(case_when(
  linelist$province == "Shanghai"    ~ "Shanghai",
  linelist$province == "Jiangsu"     ~ "Jiangsu",
  linelist$province == "Zhejiang"    ~ "Zhejiang",
  TRUE                                ~ "Other (10)"
),
# Levels part of the factor function assigns order of appearance
levels = c(                         
  "Other (10)",
  "Shanghai",
  "Jiangsu",
  "Zhejiang"
)
)

# checks
table(linelist$graph_areas, useNA = "ifany")

# Color assignments
colors_needed <- length(unique(linelist$graph_areas, na.rm=T)) - 1 # number of colors needed

# list of colors
colors_aire = c("#a6cee3", 
                           "#1f78b4",
                           "#b2df8a", 
                           "#33a02c",
                           "#fb9a99",
                           "#e31a1c",
                           "#fdbf6f",
                           "#ff7f00",
                           "#cab2d6",
                           "#6a3d9a",
                           "#ffff99",
                           "#b15928"
                           )

# Reduce number of colors to only the number needed
colors_aire <- c("#cccccc", rev(colors_aire[1:colors_needed]))


# Plot of province
#####################################
plot <- ggplot(linelist, aes(x = date_onset, fill = graph_areas)) + 
  
  stat_bin(binwidth = 7, position="stack") +
  
  scale_fill_manual(values = colors_aire, labels = str_to_sentence(levels(factor(linelist$graph_areas)))) +
  
  scale_x_date(date_breaks = "1 week", date_labels = "%d-%b", limits = c((min(linelist$date_onset, na.rm = T) - 8), (max(linelist$date_onset, na.rm = T) + 8)), expand=c(0,0)) + # I used the date onset variable here so x axes will be the same
  
  scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, 35), expand = c(0, 0)) +
  
  theme(plot.title = element_text(size = 20, hjust = 0.5, face = "bold"),
        plot.caption = element_text(hjust = 0, face = "italic"),
        
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        
        
        axis.line = element_line(colour = "black"),
        axis.text = element_text(size = 12),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
        axis.title = element_text(size = 14, face = "bold"),
        
        legend.title = element_blank(),
        legend.justification = c(0,1), 
        legend.position = c(0.05, 1),
        legend.background = element_blank(),
        legend.text = element_text(size = 12)) +
  
  guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2), ncol = 4)) +
  
  labs(x="Week of illness onset", 
       y="Number of cases",
       subtitle = "",
       caption = paste0(nrow(zone_data), " confirmed and probable cases, as of ", data_date, ". \n", missing_onset, " cases excluded due to missing date of onset.")) +
  
  ggtitle("Cases of influenza, by province")

plot

Faceting

By “wave” in time

# Define the waves
##################
# zone_data <- filter(linelist, zone_de_sante == "mabalako")
# 
# zone_data$wave <- case_when(
#   zone_data$date_onset >= as.Date("2018-03-01") &
#     zone_data$date_onset < as.Date("2018-10-25")     ~ "Wave 1",
#   
#   zone_data$date_onset >= as.Date("2018-10-25") &
#     zone_data$date_onset < as.Date("2019-02-01")     ~ "Wave 2",
#   
#   zone_data$date_onset >= as.Date("2019-02-01") &
#     zone_data$date_onset < as.Date("2019-09-15")     ~ "Wave 3",
#   
#   zone_data$date_onset >= as.Date("2019-09-15")      ~ "Wave 4",
#   
#   TRUE ~ NA_character_
# )
# 
# table(is.na(zone_data$date_onset))
# table(zone_data$wave, useNA = "always")
# 
# 
# # Color assignments
# colors_needed <- length(unique(zone_data$wave, na.rm=T)) # number of colors needed
# 
# # list of colors
# colors_aire = c("#a6cee3", 
#                            "#1f78b4",
#                            "#b2df8a", 
#                            "#33a02c",
#                            "#fb9a99",
#                            "#e31a1c",
#                            "#fdbf6f",
#                            "#ff7f00",
#                            "#cab2d6",
#                            "#6a3d9a",
#                            "#ffff99",
#                            "#b15928"
#                            )
# 
# # Reduce number of colors to only the number needed
# colors_aire <- c(rev(colors_aire[1:colors_needed]))
# 
# 
# # Plot of health zone colored by wave
# #####################################
# plot_Mabalako <- ggplot(zone_data, aes(x = date_onset, fill = wave)) + 
#   
#   stat_bin(binwidth = 7, position = "stack") +
#   
#   scale_fill_manual(values = rev(colors_aire), labels = str_to_sentence(levels(factor(zone_data$wave)))) +
#   
#   scale_x_date(date_breaks = "21 days", date_labels = "%d-%b",
#                limits = c((min(zone_data$date_onset, na.rm = T) - 8), (max(zone_data$date_report, na.rm = T) + 8)), expand = c(0,0)) +
#   
#   scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, 35), expand = c(0, 0)) +
#   
#   theme(text = element_text(family = "Segoe Condensed"),
#         axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
#         axis.text = element_text(size = 12),
#         axis.title = element_text(size = 14, face = "bold"),
#         axis.line = element_line(colour = "black"),
#         
#         plot.title = element_text(size = 20, hjust = 0.5, face = "bold"),
#         plot.caption = element_text(hjust = 0, face = "italic"),
#         
#         panel.grid.major = element_blank(),
#         panel.grid.minor = element_blank(),
#         panel.background = element_blank(),
#         
#         legend.title = element_blank(),
#         legend.justification = c(0,1), 
#         legend.position = c(0.75, 0.98),
#         legend.background = element_blank(),
#         legend.text = element_text(size=12)) +
#   
#   guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2), ncol = 1)) +
#   
#   labs(x="Week of illness onset", 
#        y="Number of cases",
#        subtitle = "",
#        caption = paste0(nrow(zone_data), " confirmed and probable cases, as of ", data_date, ". \n", missing_onset, " cases excluded due to missing date of onset and 16 excluded due to uncertain health zone of report.")) +
#   
#   ggtitle("Four waves of EVD in Mabalako health zone")
# 
# plot_Mabalako
# 
# 
# # Produce table describing each wave
# ####################################
# table <- zone_data %>%
#   select("aire_de_sante", "wave", "community_death", "date_onset", "cte_date", "epicasedef", "community_death", "contact_registered", "contact_surveilled") %>%
#   group_by(wave) %>%
#   summarise(first_onset       = min(date_onset, na.rm = T),
#             last_admission    = max(cte_date, na.rm = T),
#             n                 = n(),
#             confirmed         = sum(epicasedef == "confirmed"),
#             community_deaths  = paste0(sum(community_death    == 1), 
#                                        " (", round(100*sum(community_death == 1)/confirmed),"%)"),
#             reg_contacts      = paste0(sum(contact_registered == "yes"),
#                                        " (", round(100*sum(contact_registered == "yes")/confirmed),"%)"),
#             surv_contacts     = paste0(sum(contact_surveilled == "yes"),
#                                        " (", round(100*sum(contact_surveilled == "yes")/confirmed),"%)"),
#             top               = paste(toupper(names(sort(table(aire_de_sante),decreasing=TRUE)[1:3])), collapse=", ",
#                                       round(100*(sort(table(aire_de_sante),decreasing=TRUE)[1:3]/confirmed)), "%"),
#             health_areas      = paste(toupper(unique(aire_de_sante)), collapse=', ') 
#             )
# 
# kable(table)

Aggregated data

Situation

Often you do not have linelist data, but instead daily case counts from facilities, districts, etc. You can plot these in an epidemiological curve, but the code will be slightly different.

This section will utilize the counts_data dataset that was imported earlier, in the data preparation section.

Note: The incidence package does not support aggregate data

Clean dates

As before, we must ensure date variables are correctly classified.

# Convert Date variable to Date class
class(count_data$date_hospitalisation)
## [1] "Date"
count_data$date_hospitalisation <- as.Date(count_data$date_hospitalisation)

Create weeks

# Create epiweek variable
# aweek weeks are also stored as dates, facilitating better display manipulation
count_data$epiweek <- aweek::date2week(count_data$date_hospitalisation,      # use the Date variable
                                        week_start = "Monday", # epiweek begins on Monday
                                        floor_day = TRUE,      # only display year and week #
                                        factor = TRUE)         # expand to include all possible weeks

Clean dates

ggplot(data = count_data, aes(x = as.Date(epiweek), y = n_cases, group = hospital, fill = hospital))+
     geom_bar(stat = "identity")+
     
     # LABELS for x-axis
     scale_x_date(date_breaks = "1 month",  # displays by month
                  date_labels = '%b%d\n%Y')+  #labeled by month with year below
     
     # Choose color palette (uses RColorBrewer package)
     scale_fill_brewer(palette = "Pastel1")+      
     
     # Theme specifications (axis, text, etc.)
     theme(
          # title
          plot.title = element_text(size=20, hjust= 0, face="bold"),   # title size, font, bold
          # axes
          axis.text.x = element_text(angle=0, vjust=0.5, hjust=1),
          axis.text = element_text(size=12),
          axis.title = element_text(size=14, face="bold"),
          axis.line = element_line(colour="black"),
          # background
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          # caption (italics, on right side)
          plot.caption = element_text(hjust = 0, face = "italic"))+
     
     # labels
     labs(x = "Week of report", 
          y = "Number of cases",
          subtitle = "Cases aggregated by week and shown by hospital",
          caption = "Data source: XXXXX")+
     
     ggtitle("Epidemic curve of disease X in fictional location")

Dual-axis

Although there are fierce discussions about the validity of this within the data visualization community, many supervisors want to see an epicurve or similar chart with a percent overlaid with a second axis.

In ggplot it is very difficult to do this, except for the case where you are showing a line reflecting the proportion of a category shown in the bars below.

Second axis

This uses the linelist dataset

TODO not complete yet

library(reshape2)
# group the data by week, summarize counts by group (gender)
linelist_week <- linelist %>%
     mutate(onset_epiweek = aweek::date2week(date_onset, floor_day = TRUE, factor = TRUE)) %>%
     group_by(onset_epiweek) %>%
     summarize(num_male = sum(gender == "m"),
               num_female = sum(gender == "f"),
               pct_male = round(100*(num_male / n())),
               med_age  = median(as.numeric(age), na.rm=T)
               ) 
# remove pct and melt
linelist_week_melted <- linelist_week %>%
     select(-c("pct_male", "med_age")) %>%
     melt(id.vars = c("onset_epiweek"))

# merge together (multiple of the same values in week will attach to melted)
linelist_week_melted <- merge(linelist_week_melted,
                              linelist_week,
                              by = "onset_epiweek")

second_axis <- ggplot(linelist_week_melted,
                      aes(x = as.Date(onset_epiweek),
                          y = value, group = variable,
                          fill = variable)) + 
  
  # bars
  geom_bar(stat = "identity")+
  
  # Colors and labels of confirmed/probable
  scale_fill_manual(values = c("blue", "red"),
                    labels = str_to_sentence(levels(factor(linelist_week_melted$variable)))) +
  
  geom_line(mapping = aes(y = pct_male, color = "% male"), size = 0.5) +
     
  scale_color_manual(values = "black")+
     
  scale_y_continuous(sec.axis = sec_axis(~(./sum(linelist_week_melted$value, na.rm = T)*100), name = "name here", breaks = seq(0, 100, 20)))+


  # X-axis scale labels (not aggregation, just the labels)
  scale_x_date(# Sets date label breaks as every week
    breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm = T), by = "1 week"),
    limits = c((min(linelist$date_onset, na.rm=T)), (max(linelist$date_onset, na.rm = T))), # axis limits determined by max/min + buffer
    date_labels = "%d-%b",   # displays as date # then abbreviated month (e.g. 12 Oct)
    expand = c(0, 0)) +                                # sets origin at (0,0)
  
  # Y-scale in breaks, up to the ymax previously defined
  scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, ymax), expand=c(0, 0)) +
  
  # Themes for axes, titles, background, etc.
  theme(plot.title       = element_text(size=20, hjust=0.5, face="bold"),
        axis.text        = element_text(size=12),
        axis.title       = element_text(size=14, face="bold"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_blank(),
        axis.line        = element_line(colour = "black"),
        axis.text.x      = element_text(angle=90, vjust=0.5, hjust=1)) +
  
  # Legend specifications
  theme(legend.title           = element_blank(),
        legend.justification   = c(0, 1), 
        legend.position        = c(0.09, 0.98),
        legend.background      = element_blank(),
        legend.text            = element_text(size = 12)) +
  guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2))) +
  
  # Axis and caption labels
  labs(x = "Week of illness onset",
       y = "Number of Cases",
       caption = paste(missing_onset,"cases were missing onset date and are not included in the onset graph")) +
  
  # Title
  ggtitle("Cases by week of illness onset")

second_axis

# print
print(plot_defined_cats)

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Scatterplots

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

base R

plot(linelist$age)

ggplot2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Boxplots

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Boxplots can be created with:

  • the boxplot() function from the graphics package (installed automatically with base R), or
  • the ggplot() function from the ggplot2 package

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.
linelist <- linelist %>% 
  mutate(age = as.numeric(age))

boxplot()

Some options with boxplot() shown below are:

  • boxplots by group (color specification optional)
  • violin plots
  • boxplot width proportional to sample size
  • Horizontal
# boxplot of one numeric variable
boxplot(linelist$age,                 # numeric variable
        main="boxplot",               # main title
        xlab="Suppliment and Dose")   # x-axis label


# by group (formula style)
boxplot(age ~ gender, data=linelist, notch=TRUE, main="boxplot", xlab="Suppliment and Dose")

You can have multiple levels of group (e.g. age by outcome AND gender)
Notched “violin plots” are possible. The notch represents the median and X around it (TODO)

# By subgroup (age by outcome AND gender)
boxplot(age ~ outcome * gender,
        data=linelist,
        col=c("gold","darkgreen"),            # colors, in a vector
        main="Boxplot by Outcome and Gender", # main title
        xlab="Suppliment and Dose")           # x-axis label

# Notched (violin plot), and varying width
boxplot(age ~ outcome * gender,
        data=linelist,
        notch=TRUE,      # notch at median
        varwidth = TRUE, # width varying by sample size
        col=(c("gold","darkgreen")),
        main="Notched boxplot, width varying by sample size",
        xlab="Suppliment and Dose")

# Horizontal
boxplot(age~outcome,
        data=linelist,
        horizontal=TRUE,  # flip to horizontal
        col=(c("gold","darkgreen")),
        main="Horizontal boxplot",
        xlab="Suppliment and Dose")

ggplot()

Some options with ggplot() shown below are:

  • boxplots by group (color specification optional)
  • violin plots
  • boxplot width proportional to sample size
  • Horizontal
# Simple boxplot of one numeric variable
ggplot(data = linelist, aes(y = age))+  # only y variable given (no x variable)
  geom_boxplot()+
  ggtitle("Simple ggplot() boxplot")

# By group
ggplot(data = linelist, aes(y = age,         # numeric variable
                            x = outcome,      # group variable
                            fill = outcome))+ # fill variable (color of boxes)
  geom_boxplot()+                            # create the boxplot
  ggtitle("ggplot() boxplot by gender")      # main title

# Removing missing values, and add color
ggplot(data = linelist %>% filter(!is.na(outcome)), # dataset piped through a filter to retain rows where gender is not missing
       aes(y = age, x = outcome, fill= outcome))+    # boxes filled according to gender value
  geom_boxplot()+
  ggtitle("ggplot() boxplot by gender (missing excluded)")

To examine by subgroups, use facet_wrap() (for more see section on ggplot tips).

# By subgroup
ggplot(data = linelist %>% filter(!is.na(gender)), # dataset piped through a filter to retain rows where gender is not missing
       aes(y = age, x = outcome, fill=outcome))+
  geom_boxplot()+
  ggtitle("A ggplot() boxplot")+
  facet_wrap(~gender)

“Violin plots” can be made simply or very complex:

# Violin plots
ggplot(linelist, aes(x=age, y=outcome, fill = outcome)) + 
  geom_violin(trim=FALSE)


# Vertical violin plot
ggplot(linelist, aes(x=age, y=outcome, fill = outcome)) + 
  geom_violin(trim=FALSE)+
  coord_flip()


# Add jittered points
ggplot(linelist, aes(x=age, y=outcome, fill = outcome)) + 
  geom_violin(trim=FALSE)+
  coord_flip()+ 
  geom_jitter(shape=16,                      # points
              position=position_jitter(0.2)) # jitter permissible to avoid point overlap

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Moving averages

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Calculating Visualizing

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Transmission Chains

Overview

The primary tool to visualize and analyze transmission chains is the package epicontacts, developed by the folks at RECON.

Preparation

Visualization


links <- epicontacts::make_epicontacts(linelist = mers_korea_2015$linelist,
                                       contacts = mers_korea_2015$contacts, 
                                       directed = TRUE)
# plot without time
plot(links,
     selector = FALSE,
     height = 700,
     width = 700)

And in a transmission tree, with date of onset on the x-axis:

Note: this currently requires installing a development version of epicontacts from github… @ttree


# plot with date of onset as x-axis
plot(sim,
     x_axis = 'onset',
     height = 700,
     width = 700,
)

Analysis

summary(links)
## 
## /// Overview //
##   // number of unique IDs in linelist: 162
##   // number of unique IDs in contacts: 97
##   // number of unique IDs in both: 97
##   // number of contacts: 98
##   // contacts with both cases in linelist: 100 %
## 
## /// Degrees of the network //
##   // in-degree summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.6049  1.0000  3.0000 
## 
##   // out-degree summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.6049  0.0000 38.0000 
## 
##   // in and out degree summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    1.00    1.21    1.00   39.00 
## 
## /// Attributes //
##   // attributes in linelist:
##  age age_class sex place_infect reporting_ctry loc_hosp dt_onset dt_report week_report dt_start_exp dt_end_exp dt_diag outcome dt_death
## 
##   // attributes in contacts:
##  exposure diff_dt_onset

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Abberation detection

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Endemic corridor analysis Detecting spikes in syndromic/routine surveillance

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Time series analysis

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Survey analysis

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

From data frame

Overview

Weighting

Random selection

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Age Standardization

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Why How When etc.

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

PHEindicatormethods package

DSR package

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Diagrams

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Patient flow

E.g. EVD patient “pathways” to outcome (via clinic or not, etc.)

HIV care continuum datasets? PreP datasets?

Alluvial

Sankey plots - show transitions among cohort over time, interrelatedness of groups Liza Coyer TODO

Clinical trials

Or papers in meta-analysis

Event timelines

E.g. border closures during COVID

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Age Pyramids

Age pyramids can be useful to show patterns by age group. They can show gender, or the distribution of other characteristics.
These tabs demonstrate how to produce age pyramids using:

  • Fast & easy: Using the apyramid package
  • More flexible: Using ggplot()
  • Having baseline demographics displayed in the background of the pyramid
  • Using pyramid-style plots to show other types of data (e.g responses to Likert-style questions)

Overview

Age/gender demographic pyramids in R are generally made with ggplot() by creating two barplots (one for each gender), converting one’s values to negative values, and flipping the x and y axes to display the barplots vertically.

Here we offer a quick approach through the apyramid package:

  • More customizable code using the raw ggplot() commands
  • How to combine case demographic data and compare with that of a baseline population (as shown above)
  • Application of these methods to show other types of data (e.g. responses to Likert-style survey questions)

Preparation

For this tab we use the linelist dataset that is cleaned in the Cleaning tab.

To make a traditional age/sex demographic pyramid, the data must first be cleaned in the following ways:

  • The gender column must be cleaned.
  • Age should be in an age category column, and should be an of class Factor (with correctly ordered levels)

Load packages

First, load the packages required for this analysis:

pacman::p_load(rio,       # to import data
               here,      # to locate files
               tidyverse, # to clean, handle, and plot the data (includes ggplot2 package)
               apyramid,  # a package dedicated to creating age pyramids
               stringr)   # working with strings for titles, captions, etc.

Load the data

linelist <- rio::import("linelist_cleaned.csv")

Check class of variables

Ensure that the age variable is class Numeric, and check the class and order of levels of age_cat and age_cat5

class(linelist$age_years)
## [1] "numeric"

class(linelist$age_cat)
## [1] "factor"
class(linelist$age_cat5)
## [1] "factor"

table(linelist$age_cat, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-29 30-49 50-69   70+  <NA> 
##  1172  1130   995   895  1090   605    26     0    92
table(linelist$age_cat5, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 
##  1172  1130   995   895   649   441   294   179    86    46    21     4     1 
## 65-69 70-74 75-79 80-84   85+  <NA> 
##     0     0     0     0     0    92

apyramid package

The package apyramid allows you to quickly make an age pyramid. For more nuanced situations, see the tab on using ggplot() to make age pyramids. You can read more about the apyramid package in its Help page by entering ?age_pyramid in your R console.

age_pyramid() with linelist data

Using the cleaned linelist dataset, we can create an age pyramid with just one simple command. If you need help cleaning your data, see the handbook page on Cleaning data (LINK). In this command:

  • The data argument is set as the linelist dataframe
  • The age_group argument is set to the name (in quotes) of the numeric category variable (in this case age_cat5)
  • The split_by argument (bar colors) should be a binary column (in this case “gender”)
apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "gender")

When using agepyramid package, if the split_by column is binary (e.g. male/female, or yes/no), then the result will appear as a pyramid. However if there are more than two values in the split_by column (not including NA), the pyramid will appears as a faceted barplot with empty bars in the background indicating the range of the un-faceted data set for the age group. Values of split_by will appear as labels at top of each facet. For example below if the split_by variable is “hospital”.

apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "hospital",
                      na.rm = FALSE)        # show a bar for patients missing age, (note: this changes the pyramid into a faceted barplot)

Missing values
Rows missing values for the split_by or age_group columns, if coded as NA, will not trigger the faceting shown above. By default these rows will not be shown. However you can specify that they appear, in an adjacent barplot and as a separate age group at the top, by specifying na.rm = FALSE.

apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "gender",
                      na.rm = FALSE)         # show patients missing age or gender

Proportions, colors, & aesthetics

By default, the bars display counts (not %), a dashed mid-line for each group is shown, and the colors are green/purple. Each of these parameters can all be adjusted, as shown below:

You can also add additional ggplot() commands to the plot using the standard ggplot() “+” syntax, such as aesthetic themes and label adjustments:

apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "gender",
                      proportional = TRUE,                  # show percents, not counts
                      show_midpoint = FALSE,                # remove bar mid-point line
                      #pal = c("orange", "purple")          # can specify alt. colors here (but not labels, see below)
                      )+                 
  
  # additional ggplot commands
  theme_minimal()+                                          # simplify the background
  scale_fill_manual(values = c("orange", "purple"),         # to specify colors AND labels
                     labels = c("Male", "Female"))+
  labs(y = "Percent of all cases",                          # note that x and y labels are switched (see ggplot tab)
       x = "Age categories",                          
       fill = "Gender", 
       caption = "My data source and caption here",
       title = "Title of my plot",
       subtitle = "Subtitle with \n a second line...")+
  theme(
    legend.position = "bottom",                             # move legend to bottom
    axis.text = element_text(size = 10, face = "bold"),     # fonts/sizes, see ggplot tips page
    axis.title = element_text(size = 12, face = "bold"))

age_pyramid() with aggregated data

The examples above assume your data are in a linelist-like format, with one row per observation. If your data are already aggregated into counts by age category, you can still use the apyramid package, as shown below.

Let’s say that your dataset looks like this, with columns for age category, and male counts, female counts, and missing counts.
(see the handbook page on Transforming data for tips)

# View the aggregated data
DT::datatable(demo_agg, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

ggplot() perfers data in “long” format, so first pivot the data to be “long” with the pivot_longer() function from dplyr.

# pivot the aggregated data into long format
demo_agg_long <- demo_agg %>% 
  pivot_longer(c(f, m, missing_gender),            # cols to elongate
               names_to = "gender",                # name for new col of categories
               values_to = "counts") %>%           # name for new col of counts
  mutate(gender = na_if(gender, "missing_gender")) # convert "missing_gender" to NA
# View the aggregated data
DT::datatable(demo_agg_long, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Then use the split_by and count arguments of age_pyramid() to specify the respective columns:

apyramid::age_pyramid(data = demo_agg_long,
                      age_group = "age_cat5",
                      split_by = "gender",
                      count = "counts")      # give the column name for the aggregated counts

Note in the above, that the factor order of “m” and “f” is different (pyramid reversed). To adjust the order you must re-define gender in the aggredated data as a Factor and order the levels as desired.

ggplot()

Using ggplot() to build your age pyramid allows for more flexibility, but requires more effort and understanding of how ggplot() works. It is also easier to accidentally make mistakes.

apyramid uses ggplot() in the background (and accepts ggplot() commands added), but this page shows how to adjust or recreate a pyramid only using ggplot(), if you wish.

Constructing the plot

First, understand that to make such a pyramid using ggplot() the approach is to:

  • Within the ggplot(), create two graphs by age category. Create one for each of the two grouping values (in this case gender). See filters applied to the data arguments in each geom_histogram() commands below.

  • If using geom_histogram(), the graphs operate off the numeric column (e.g. age_years), whereas if using geom_barplot() the graphs operate from an ordered Factor (e.g. age_cat5).

  • One graph will have positive count values, while the other will have its counts converted to negative values - this allows both graphs to be seen and compared against each other in the same plot.

  • The command coord_flip() switches the X and Y axes, resulting in the graphs turning vertical and creating the pyramid.

  • Lastly, the counts-axis labels must be specified so they appear as “positive” counts on both sides of the pyramid (despite the underlying values on one side being negative).

A simple version of this, using geom_histogram(), is below:

  # begin ggplot
  ggplot(data = linelist, aes(x = age, fill = gender)) +
  
  # female histogram
  geom_histogram(data = filter(linelist, gender == "f"),
                 breaks = seq(0,85,5),
                 colour = "white") +
  
  # male histogram (values converted to negative)
  geom_histogram(data = filter(linelist, gender == "m"),
                 breaks = seq(0,85,5),
                 aes(y=..count..*(-1)),
                 colour = "white") +
  
  # flip the X and Y axes
  coord_flip() +
  
  # adjust counts-axis scale
  scale_y_continuous(limits = c(-600, 900),
                     breaks = seq(-600,900,100),
                     labels = abs(seq(-600, 900, 100)))

DANGER: If the limits of your counts axis are set too low, and a counts bar exceeds them, the bar will disappear entirely or be artificially shortened! Watch for this if analyzing data which is routinely updated. Prevent it by having your count-axis limits auto-adjust to your data, as below.

There are many things you can change/add to this simple version, including:

  • Auto adjust counts-axis count scale to your data (avoid errors discussed in warning below)
  • Manually specify colors and legend labels
# create dataset with proportion of total
pyramid_data <- linelist %>%
  group_by(age_cat5, gender) %>% 
  summarize(counts = n()) %>% 
  ungroup() %>% 
  mutate(percent = round(100*(counts / sum(counts, na.rm=T)),1), 
         percent = case_when(
            gender == "f" ~ percent,
            gender == "m" ~ -percent,
            TRUE          ~ NA_real_))

max_per <- max(pyramid_data$percent, na.rm=T)
min_per <- min(pyramid_data$percent, na.rm=T)


# begin ggplot
  ggplot()+  # default x-axis is age in years;

  # case data graph
  geom_bar(data = pyramid_data,
           stat = "identity",
           aes(x = age_cat5,
               y = percent,
               fill = gender),        # 
           colour = "white")+         # white around each bar
  
  # flip the X and Y axes to make pyramid vertical
  coord_flip()+
  

  # adjust the axes scales (remember they are flipped now!)
  #scale_x_continuous(breaks = seq(0,100,5), labels = seq(0,100,5)) +
  scale_y_continuous(limits = c(min_per, max_per),
                     breaks = seq(floor(min_per), ceiling(max_per), 2),
                     labels = paste0(abs(seq(floor(min_per), ceiling(max_per), 2)), "%"))+

  # designate colors and legend labels manually
  scale_fill_manual(
    values = c("f" = "orange",
               "m" = "darkgreen"),
    labels = c("Female", "Male"),
  ) +
  
  # label values (remember X and Y flipped now)
  labs(
    x = "Age group",
    y = "Percent of total",
    fill = NULL,
    caption = stringr::str_glue("Data are from linelist \nn = {nrow(linelist)} (age or sex missing for {sum(is.na(linelist$gender) | is.na(linelist$age_years))} cases) \nData as of: {format(Sys.Date(), '%d %b %Y')}")) +
  
  # optional aesthetic themes
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_blank(),
    axis.line = element_line(colour = "black"),
    plot.title = element_text(hjust = 0.5), 
    plot.caption = element_text(hjust=0, size=11, face = "italic")) + 
  
  ggtitle(paste0("Age and gender of cases"))

Compare to expected demographics

With the flexibility of ggplot(), you can have a second layer of bars in the background that represent the true population pyramid. This can provide a nice visualization to compare the observed counts with the baseline.

Import and view the population data

# import the population demographics data
pop <- rio::import("country_demographics.csv")
# display the linelist data as a table
DT::datatable(pop, rownames = FALSE, filter="top", options = list(pageLength = 10, scrollX=T) )

First some data management steps:

Here we record the order of age categories that we want to appear. Due to some quirks the way the ggplot() is implemented, it is easiest to store these as a character vector and use them later in the plotting function.

# record correct age cat levels
age_levels <- c("0-4","5-9", "10-14", "15-19", "20-24",
                "25-29","30-34", "35-39", "40-44", "45-49",
                "50-54", "55-59", "60-64", "65-69", "70-74",
                "75-79", "80-84", "85+")

Combine the population and case data through the dplyr function bind_rows():

  • First, ensure they have the exact same column names, age categories values, and gender values
  • Make them have the same data structure: columns of age category, gender, counts, and percent of total
  • Bind them together, one on-top of the other (bind_rows())
# create/transform populaton data, with percent of total
########################################################
pop_data <- pivot_longer(pop, c(m, f), names_to = "gender", values_to = "counts") %>% # pivot gender columns longer
  mutate(data = "population",                                                         # add column designating data source
         percent  = round(100*(counts / sum(counts, na.rm=T)),1),                     # calculate % of total
         percent  = case_when(                                                        # if male, convert % to negative
                            gender == "f" ~ percent,
                            gender == "m" ~ -percent,
                            TRUE          ~ NA_real_))

Review the changed population dataset

# display the linelist data as a table
DT::datatable(pop_data, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Now implement the same for the case linelist. Slightly different because it begins with case-rows, not counts.

# create case data by age/gender, with percent of total
#######################################################
case_data <- linelist %>%
  group_by(age_cat5, gender) %>%  # aggregate linelist cases into age-gender groups
  summarize(counts = n()) %>%     # calculate counts per age-gender group
  ungroup() %>% 
  mutate(data = "cases",                                          # add column designating data source
         percent = round(100*(counts / sum(counts, na.rm=T)),1),  # calculate % of total for age-gender groups
         percent = case_when(                                     # convert % to negative if male
            gender == "f" ~ percent,
            gender == "m" ~ -percent,
            TRUE          ~ NA_real_))

Review the changed case dataset

# display the linelist data as a table
DT::datatable(case_data, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Now the two datasets are combined, one on top of the other (same column names)

# combine case and population data (same column names, age_cat values, and gender values)
pyramid_data <- bind_rows(case_data, pop_data)

Store the maximum and minimum percent values, used in the plotting funtion to define the extent of the plot (and not cut off any bars!)

# Define extent of percent axis, used for plot limits
max_per <- max(pyramid_data$percent, na.rm=T)
min_per <- min(pyramid_data$percent, na.rm=T)

Now the plot is made with ggplot():

  • One bar graph of population data (wider, more transparent bars)
  • One bar graph of case data (small, more solid bars)

# begin ggplot
##############
ggplot()+  # default x-axis is age in years;

  # population data graph
  geom_bar(data = filter(pyramid_data, data == "population"),
           stat = "identity",
           aes(x = age_cat5,
               y = percent,
               fill = gender),        
           colour = "black",                               # black color around bars
           alpha = 0.2,                                    # more transparent
           width = 1)+                                     # full width
  
  # case data graph
  geom_bar(data = filter(pyramid_data, data == "cases"), 
           stat = "identity",                              # use % as given in data, not counting rows
           aes(x = age_cat5,                               # age categories as original X axis
               y = percent,                                # % as original Y-axis
               fill = gender),                             # fill of bars by gender
           colour = "black",                               # black color around bars
           alpha = 1,                                      # not transparent 
           width = 0.3)+                                   # half width
  
  # flip the X and Y axes to make pyramid vertical
  coord_flip()+
  
  # adjust axes order, scale, and labels (remember X and Y axes are flipped now)
  # manually ensure that age-axis is ordered correctly
  scale_x_discrete(limits = age_levels)+ 
  
  # set percent-axis 
  scale_y_continuous(limits = c(min_per, max_per),                                          # min and max defined above
                     breaks = seq(floor(min_per), ceiling(max_per), by = 2),                # from min% to max% by 2 
                     labels = paste0(                                                       # for the labels, paste together... 
                       abs(seq(floor(min_per), ceiling(max_per), by = 2)),                  # ...rounded absolute values of breaks... 
                       "%"))+                                                               # ... with "%"
                                                                                            # floor(), ceiling() round down and up 

  # designate colors and legend labels manually
  scale_fill_manual(
    values = c("f" = "orange",         # assign colors to values in the data
               "m" = "darkgreen"),
    labels = c("f" = "Female",
               "m"= "Male"),      # change labels that appear in legend, note order
  ) +

  # plot labels, titles, caption    
  labs(
    title = "Case age and gender distribution,\nas compared to baseline population",
    subtitle = "",
    x = "Age category",
    y = "Percent of total",
    fill = NULL,
    caption = stringr::str_glue("Cases shown on top of country demographic baseline\nCase data are from linelist, n = {nrow(linelist)}\nAge or gender missing for {sum(is.na(linelist$gender) | is.na(linelist$age_years))} cases\nCase data as of: {format(max(linelist$date_onset, na.rm=T), '%d %b %Y')}")) +
  
  # optional aesthetic themes
  theme(
    legend.position = "bottom",                             # move legend to bottom
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_blank(),
    axis.line = element_line(colour = "black"),
    plot.title = element_text(hjust = 0), 
    plot.caption = element_text(hjust=0, size=11, face = "italic"))

Likert Scale

The techniques used to make a population pyramid with ggplot() can also be used to make plots of Likert-scale survey data.

Import the data

# import the likert survey response data
likert_data <- rio::import("likert_data.csv")

Start with data that looks like this, with a categorical classification of each respondent (status) and their answers to 8 questions on a 4-point Likert-type scale (“Very poor”, “Poor”, “Good”, “Very good”).

# display the linelist data as a table
DT::datatable(likert_data, rownames = FALSE, filter="top", options = list(pageLength = 10, scrollX=T) )

First, some data management steps:

  • Pivot the data longer
  • Create new column direction depending on whether response was generally “positive” or “negative”
  • Set the Factor level order for the status column and the Response column
  • Store the max count value so limits of plot are appropriate
melted <- pivot_longer(likert_data, Q1:Q8, names_to = "Question", values_to = "Response") %>% 
     mutate(direction = case_when(
               Response %in% c("Poor","Very Poor") ~ "Negative",
               Response %in% c("Good", "Very Good") ~ "Positive",
               TRUE ~ "Unknown"),
            status = factor(status, levels = rev(c(
                 "Senior", "Intermediate", "Junior"))),
            Response = factor(Response, levels = c("Very Good", "Good",
                                             "Very Poor", "Poor"))) # must reverse Very Poor and Poor for ordering to work

melted_max <- melted %>% 
   group_by(status, Question) %>% 
   summarize(n = n())

melted_max <- max(melted_max$n, na.rm=T)

Now make the plot:

# make plot
ggplot()+
     # bar graph of the "negative" responses 
     geom_bar(data = filter(melted,
                            direction == "Negative"), 
              aes(x = status,
                        y=..count..*(-1),    # counts inverted to negative
                        fill = Response),
                    color = "black",
                    closed = "left", 
                    position = "stack")+
     
     # bar graph of the "positive responses
     geom_bar(data = filter(melted, direction == "Positive"),
              aes(x = status, fill = Response),
              colour = "black",
              closed = "left",
              position = "stack")+
     
     # flip the X and Y axes
     coord_flip()+
  
     # Black vertical line at 0
     geom_hline(yintercept = 0, color = "black", size=1)+
     
    # convert labels to all positive numbers
    scale_y_continuous(limits = c(-ceiling(melted_max/10)*11, ceiling(melted_max/10)*10),   # seq from neg to pos by 10, edges rounded outward to nearest 5
                       breaks = seq(-ceiling(melted_max/10)*10, ceiling(melted_max/10)*10, 10),
                       labels = abs(unique(c(seq(-ceiling(melted_max/10)*10, 0, 10),
                                            seq(0, ceiling(melted_max/10)*10, 10))))) +
     
    # color scales manually assigned 
    scale_fill_manual(values = c("Very Good"  = "green4", # assigns colors
                                  "Good"      = "green3",
                                  "Poor"      = "yellow",
                                  "Very Poor" = "red3"),
                       breaks = c("Very Good", "Good", "Poor", "Very Poor"))+ # orders the legend
     
    
     
    # facet the entire plot so each question is a sub-plot
    facet_wrap(~Question, ncol = 3)+
     
    # labels, titles, caption
    labs(x = "Respondent status",
          y = "Number of responses",
          fill = "")+
     ggtitle(str_glue("Likert-style responses\nn = {nrow(likert_data)}"))+

     # aesthetic settings
     theme_minimal()+
     theme(axis.text = element_text(size = 12),
           axis.title = element_text(size = 14, face = "bold"),
           strip.text = element_text(size = 14, face = "bold"),  # facet sub-titles
           plot.title = element_text(size = 20, face = "bold"),
           panel.background = element_rect(fill = NA, color = "black")) # black box around each facet

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Heatmaps

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Nice to use when tracking metrics at many facilities/regions over time

For example:

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Tracking performance

DT::datatable(facility_count_data, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T))

Group the data by week and location, and then make summary calculations:

# Get number & percent of days data reported for, by week
agg_weeks <- facility_count_data %>% 
  mutate(week = as.Date(aweek::week2date(aweek::date2week(data_date, floor_day = T)))) %>% 
  group_by(location_name, week) %>% 
  summarize(n_days = 7,
            n_reports = n(),
            malaria_tot = sum(malaria_tot, na.rm = T),
            n_days_reported = length(unique(data_date)),
            p_days_reported = round(100*(n_days_reported / n_days))) %>% 
  filter(week < as.Date("2019-06-10"))

### Days
agg_days <- facility_count_data %>% 
  
  filter(data_date < as.Date("2019-06-10"))
  

Then we make the plot:

ggplot(agg_weeks, aes(x=week, y=location_name, fill= p_days_reported))+
  geom_tile(colour="white",size=0.2)+
  guides(fill=guide_legend(title="Reporting\nperformance (%)"))+
  labs(x="Week (date of data)",
       y="Facility name",
       title="Percent of days per week that facility reported data",
       subtitle = "52 health facilities, April-May 2019",
       caption = "7-day weeks beginning on Mondays.")+
  scale_fill_gradient(low = "yellow", high = "darkgreen", na.value = "grey80")+
  theme_light()+
  theme(legend.position="right",legend.direction="vertical",
        legend.title=element_text(size=12, face="bold"),
        legend.margin=margin(grid::unit(0,"cm")),
        legend.text=element_text(size=10,face="bold"),
        legend.key.height=grid::unit(0.8,"cm"),
        legend.key.width=grid::unit(0.2,"cm"),
        axis.text.x=element_text(size=12),
        axis.text.y=element_text(vjust=0.2),
        axis.ticks=element_line(size=0.4),
        axis.title=element_text(size=12, face="bold"),
        plot.background=element_blank(),
        panel.border=element_blank(),
        plot.margin=margin(0.7,0.4,0.1,0.2,"cm"),
        plot.title=element_text(hjust=0,size=14,face="bold"))

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Combination analysis

Overview

This analysis plots the frequency of different combinations of values/responses. In this example, we plot the frequency of symptom combinations.

This analysis is often called:
Multiple response analysis Sets analysis Combinations analysis

The first method shown uses the package ggupset, an the second using the package UpSetR.

An example plot is below. Five symptoms are shown. Below each vertical bar is a line and dots indicating the combination of symptoms reflected by the bar above. To the right, horizontal bars reflect the frequency of each individual symptom.

Preparation

View the data

This linelist includes five “yes/no” variables on reported symptoms. We will need to transform these variables a bit to use the ggupset package to make our plot.

View the data (scroll to the right to see the symptoms variables)

DT::datatable(linelist_sym, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T))

Re-format values

We convert the “yes” and “no the the actual symptom name. If”no", we set the value as blank.

# create column with the symptoms named, separated by semicolons
linelist_sym_1 <- linelist_sym %>% 
  
  # convert the "yes" and "no" values into the symptom name itself
  mutate(fever = case_when(fever == "yes" ~ "fever",          # if old value is "yes", new value is "fever"
                           TRUE           ~ NA_character_),   # if old value is anything other than "yes", the new value is NA
         
         chills = case_when(chills == "yes" ~ "chills",
                           TRUE           ~ NA_character_),
         
         cough = case_when(cough == "yes" ~ "cough",
                           TRUE           ~ NA_character_),
         
         aches = case_when(aches == "yes" ~ "aches",
                           TRUE           ~ NA_character_),
         
         shortness_of_breath = case_when(shortness_of_breath == "yes" ~ "shortness_of_breath",
                           TRUE           ~ NA_character_))

Now we make two final variables:
1. Pasting together all the symptoms of the patient (character variable)
2. Convert the above to class list, so it can be accepted by ggupset to make the plot

linelist_sym_1 <- linelist_sym_1 %>% 
  mutate(
         # combine the variables into one, using paste() with a semicolon separating any values
         all_symptoms = paste(fever, chills, cough, aches, shortness_of_breath, sep = "; "),
         
         # make a copy of all_symptoms variable, but of class "list" (which is required to use ggupset() in next step)
         all_symptoms_list = as.list(strsplit(all_symptoms, "; "))
         )

View the new data. Note the two columns at the end - the pasted combined values, and the list

DT::datatable(linelist_sym, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T))

ggupset

Load required package to make the plot (ggupset)

pacman::p_load(ggupset)

Create the plot:

ggplot(linelist_sym_1,
       aes(x=all_symptoms_list)) +
geom_bar() +
scale_x_upset(reverse = FALSE,
              n_intersections = 10,
              sets = c("fever", "chills", "cough", "aches", "shortness_of_breath")
              )+
  labs(title = "Signs & symptoms",
       subtitle = "10 most frequent combinations of signs and symptoms",
       caption = "Caption here.",
       x = "Symptom combination",
       y = "Frequency in dataset")

More information on ggupset can be found online or offline in the package documentation in your RStudio Help tab.

UpSetR

The UpSetR package allows more customization, but it more difficult to execute:

https://github.com/hms-dbmi/UpSetR read this https://gehlenborglab.shinyapps.io/upsetr/ Shiny App version - you can upload your own data https://cran.r-project.org/web/packages/UpSetR/UpSetR.pdf documentation - difficult to interpret

pacman::p_load(UpSetR)

Convert symptoms variables to 1/0.

# Make using upSetR

linelist_sym_2 <- linelist_sym %>% 
  
  # convert the "yes" and "no" values into the symptom name itself
  mutate(fever = case_when(fever == "yes" ~ 1,          # if old value is "yes", new value is "fever"
                           TRUE           ~ 0),   # if old value is anything other than "yes", the new value is NA
         
         chills = case_when(chills == "yes" ~ 1,
                           TRUE           ~ 0),
         
         cough = case_when(cough == "yes" ~ 1,
                           TRUE           ~ 0),
         
         aches = case_when(aches == "yes" ~ 1,
                           TRUE           ~ 0),
         
         shortness_of_breath = case_when(shortness_of_breath == "yes" ~ 1,
                           TRUE           ~ 0))

Now make the plot, using only the symptom variables. Must designate which “sets” to compare (the names of the symptom variables).
Alternatively use nsets = and order.by = "freq" to only show the top X combinations.


# Make the plot
UpSetR::upset(
  select(linelist_sym_2, fever, chills, cough, aches, shortness_of_breath),
  sets = c("fever", "chills", "cough", "aches", "shortness_of_breath"),
  order.by = "freq",
  sets.bar.color = c("blue", "red", "yellow", "darkgreen", "orange"), # optional colors
  empty.intersections = "on",
  # nsets = 3,
  number.angles = 0,
  point.size = 3.5,
  line.size = 2, 
  mainbar.y.label = "Symptoms Combinations",
  sets.x.label = "Patients with Symptom")

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

R Markdown

R Markdown is a fantastic tool for creating automated, reproducible, and share-worthy outputs. It can generate static or interactive outputs, in the form of html, word, pdf, powerpoint, and others.

Overview

Using markdown will allow you easily recreate an entire formatted document, including tables/figures/text, using new data (e.g. daily surveillance reports) and/or subsets of data (e.g. reports for specific geographies).

This guide will go through the basics. See ‘resources’ tab for further info.

Preparation

Preparation of an R Markdown workflow involves ensuring you have set up an R project and have a folder structure that suits the desired workflow.

For instance, you may want an ‘output’ folder for your rendered documents, an ‘input’ folder for new cleaned data files, as well as subfolders within them which are date-stamped or reflect the subgeographies of interest. The markdown itself can go in a ‘rmd’ subfolder, particularly if you have multiple Rmd files within the same project.

You can set code up to create output subfolders for you each time you run reports (see “Producing an output”), but you should have the overall design in mind.

Because R Markdown can run into pandoc issues when running on a shared network drive, it is recommended that your folder is on your local machine, e.g. in a project within ‘My Documents’. If you use Git (much recommended!), this will be familiar.

The R Markdown file

An R Markdown document looks like and can be edited just like a standard R script, in R Studio. However, it contains more than just the usual R code and hashed comments. There are three basic components:

1. Metadata: This is referred to as the ‘YAML’ and is at the top of the R Markdown document between two ‘—‘s. It will tell your Rmd file what type of output to produce, formatting preferences, and other metadata such as document title, author, and date. There are other uses not mentioned here (but referred to in ‘Producing an output’). Note that indentation matters.

2. Text: This is the narrative of your document, including the titles. It is written in the markdown language, used across many different programmes. This means you can add basic formatting, for instance:

  • _text_ or *text* to italicise
  • **text** for bold text
  • # at the start of a new line for a title (and ## for second-level title, ## for third-level title etc)
  • * at the start of a new line for bullet points

The actual appearance of the font can be set by using specific templates (specified in the YAML; see example tabs).

You can also include minimal R code within backwards ticks, for within-text values. See example below.

3. Code chunks: This is where the R code goes, for the actual data management and visualisation. To note: These ‘chunks’ will appear to have a slightly different background colour from the narrative part of the document.

Each chunk always starts with three backticks and chunk information within squiggly brackets, and ends with three more backticks.

Some notes about the content of the squiggly brackets:

  • They start with ‘r’ to indicate that the language name within the chunk is r
  • Followed by the chunk name - note this should ALWAYS be a unique name or else R will complain when you try to render.
  • It can include other options too, but many of these can be configured with point-and-click using the setting buttons at the top right of the chunk. Here, you can specify which parts of the chunk you want the rendered document to include, namely the code, the outputs, and the warnings. This will come out as written preferences within the squiggly brackets, e.g. ‘echo=FALSE’ if you specify you want to ‘Show output only’.

There are also two arrows at the top right of each chunk, which are useful to run code within a chunk, or all code in prior chunks.

Producing an output

General notes

Everything used by this markdown must be referenced within the Rmd file. For instance, you need to load any required packages or data.

A single or test run from within R Markdown

To render a single document, for instance if you are testing it or if you only need to produce one rendered document at a time, you can do it from within the open R Markdown file. Click the button at the top of the document.

The ‘R Markdown’ tab will start processing to show you the overall progress, and a complete document will automatically open when complete. This document will also be saved in the same folder as your markdown, and with the same file name aside from the file extension. This is obviously not ideal for version control, as you will then rename the file yourself.

A single run from an separate script

To run the markdown so that a date-stamped file is produced, you can create a separate script and call the Rmd file from within it. You can also specify the folder and file name, and include a dynamic date and time, so that file will be date stamped on production.

rmarkdown::render(("rmd_reports/create_RED_report.Rmd"),  
                        output_file = paste0("outputs/Report_", Sys.Date, ".docx")) # Use 'paste0' to combine text and code for a dynamic file name

Routine runs into newly created date-stamped sub folders

Add a couple lines of code to define the date you are running the report (e.g. using Sys.Date as in the example above) and create new sub folders. If you want the date to reflect a specific date rather than the current date, you can also enter it as an object.

# Set the date of report
refdate <- as.Date("2020-12-21")

# Create the folders
outputfolder <- paste0("outputs/", refdate) # This is the new folder name
dir.create(outputfolder) # Creates the folder (in this case assumed 'outputs' already exists)

#Run the loop
rmarkdown::render(("rmd_reports/create_report.Rmd"),  
                        output_file = paste0(outputfolder, "/Report_", refdate, ".docx")) #Dyanmic folder name now included

You may want some dynamic information to be included in the markdown itself. This is addressed in the next section.

Parametrised reports

Parameterised reports are the next step so that the content of the R Markdown itself can also be dynamic. For example, the title can change according to the subgeography you are running, and the data can filter to that subgeography of interest.

Let’s say you want to run the markdown to produce a report with surveillance data for Area1 and Area2. You will:

  1. Edit your R Markdown:
  1. Change your YAML to include a ‘params’ section, which specifies the dynamic object.
  2. Refer to this parameterised object within the code as needed. E.g. filter(area == params$areanumber) rather than filter(area=="Area1").

For instance (simplified version which does not include setup code such as library/data loading):

You can change the content by editing the YAML as needed, or set up a loop in a separate script to iterate through the areas. As with the previous section, you can set up the folders as well.

As you can see below, you set up a list which includes all areas of interest (arealist), and when rendering the markdown you specify that the parameterized areanumber for a specific iteration is the Nth value of the arealist. For instance, for the first iteration, areanumber will equate to “Area1”. The code below also specifies that the Nth area name will be included in the output file name.

Note that this will work even if an area or date are specified within the YAML itself - that YAML information will get overwritten by the loop.

# Set the date of report
refdate <- as.Date("2020-12-21")

# Set the list (note that this can also be an imported list)
arealist <- c("Area1", "Area2", "Area3", "Area4", "Area5")

# Create the folders
outputfolder <- paste0("outputs/", refdate) # This is the new folder name
dir.create(outputfolder) # Creates the folder (in this case assumed 'outputs' already exists)

#Run the loop

for(i in 1:length(arealist))  { # This will loop through from the first value to the last value in 'arealist'

rmarkdown::render(here("rmd_reports/create_report.Rmd"), 
                        params = list(areanumber = arealist[1], #Assigns the nth value of arealist to the current areanumber
                                      refdate = refdate),
                        output_file = paste0(outputfolder, "/Report_", arealist[1], refdate, ".docx")) 
                        
}

GIS basics

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Plotting coordinates

polygons and shapefiles

Simple analyses

Distance to nearest X (HCF)

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

if/else and for loops

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Modeling

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

Overview

Tidymodels

Logistic Regression

Multi-level modeling Regression

Survival analysis

Multi-stage Markov models

Liza Coyer TODO this? logitudinal data

Tables of model results

Causal diagrams

Epidemic modeling

Overview

R(t) estimations Doubling times Projections

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

R projects

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Advanced RStudio

Overview

rprofiles

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

ggplot tips

Overview

Embed ggplot cheatsheet

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Highlighting

highlighting one line among many etc gghighlight

Dual axes

Cowplot Complicated method (% 100 * …)

Smart Labeling

ggrepel

Time axes

Dual axes

Adding shapes

Animations

R with interactive console

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

sapply/mapply/lapply

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Shiny basics

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

flexdashboard

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Writing functions

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Relational databases

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

R on network drives

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Troubleshooting tips, common errors, etc.

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Report factory

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Github and R

The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}

Overview

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Interactive plots

Overview

PLOTLY

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Directory interactions

Overview

Saving files, deleting files, creating folders, interacting with files in a folder, etc Overwriting files in Excel

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.

Errors & Warnings

Overview

Troubleshooting common errors and warnings

Keep the title of this section as “Overview”.
This tab should include:

  • Textual overview of the purpose of this page
  • Small image showing outputs

Preparation

Keep the title of this section as “Preparation”.
Data preparation steps such as:

  • Loading dataset
  • Adding or changing variables
  • melting, pivoting, grouping, etc.

sub-tab 1

Can be used to separate major steps of data preparation. Re-name as needed

#Tried to add a value ("Missing") to a factor (with replace_na operating on a factor)
Problem with `mutate()` input `age_cat`.
i invalid factor level, NA generated
i Input `age_cat` is `replace_na(age_cat, "Missing")`.invalid factor level, NA generated
# ran recode without re-stating the x variable in mutate(x = recode(x, OLD = NEW)
Error: Problem with `mutate()` input `hospital`.
x argument ".x" is missing, with no default
i Input `hospital` is `recode(...)`.

Error: Insufficient values in manual scale. 3 needed but only 2 provided. ggplot() scale_fill_manual() values = c(“orange”, “purple”) … insufficient for number of factor levels … consider whether NA is now a factor level…

sub-tab 2

Can be used to separate major steps of data preparation. Re-name as needed.

Option 1

This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.

Option 1 sub-tab

Sub-tabs if necessary. Re-name as needed.

Option 2

This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.

Option 2 sub-tab

Sub-tabs if necessary. Re-name as needed.

Resources

This tab should stay with the name “Resources”. Links to other online tutorials or resources.